n-Grams and the UN Taxonomy

This blog post is an attempt to present ideas that are a bit half-baked in an effort to appear more prolific than I might otherwise. In this case I combine a set of keywords that the United Nations has compiled with the Google n-Grams dataset to see how often these important terms have been used historically over time in each of the official UN languages.

The UN Library has a hierarchical taxonomy of about 7,000 terms that are used to index its documents. It includes terms like ‘maritime law’, ‘peacemaking’ and ‘semitic languages’ as well as more obscure ones such as ‘persistent organic pollutants’. It’s a very rich resource as all the terms are translated into each of the 6 UN languages (English, Spanish, Arabic, Russian, Chinese and French). What’s more is that there are semantic relations provided. So each term has related terms, broader terms and more specific terms. For exampleimmigration law’ links to ‘law, ‘deportation’ and so on.

The first challenge was to make something useful from a webpage that looks like this:

Example term in the UNBIS Thesaurus

The first thing I did was to scrape everything from the site so we have the full taxonomy and all the relations. There is clearly a structured database behind this site but like much information housed by public organisations, there is no way to access it through an API or data dump. So I hacked together a scraper using Python’s excellent Beautiful Soup module (a slightly messy GitHub repo is here in case you’re interested, and here is the full ontology in a JSON file if you want to cut to the chase). The final data structure is a nested dictionary of the form

relationsHash[‘ES’][‘INDUSTRIA METALMECANICA’][‘Broader terms’]=[‘INDUSTRIA PESADA’]

Then I thought it would be interesting to see how often these UN terms are used: is the UN diplomatic language becoming less relevant? Or is it becoming more relevant in some linguistic regions of the world? Next I plugged these terms into the n-grams viewer to compare (unfortunately it doesn’t have Arabic, so I had to drop that). One of my favourites is printed below; it shows that the rate at which patents are are mentioned in written text is plateauing in English, French and Spanish. But we see a huge uptick more recently in Mandarin, presumably attributable to growth in the Chinese knowledge economy.

Trends in the rate of use of the word ‘patents’ over time in English, Chinese, French, Russian and Spanish 1920–2010

I also hosted a ‘live’ version here. It relies on hitting the n-grams server to get the embeddable widget quite hard, so sometimes they don’t all appear. Likewise if the volume is very low then the graphs for some terms don’t work. I also tried to include related terms on the same plot to see if synonyms behaved in the same way, an unreliable page showing it is here.

Given more time and energy, a nice extension of this project would be an investigation into how diplomacy-related language is used differently between language regions of the world and over time. There could be a comparison between books (n-grams), news (GDELT) and social media (Twitter, Facebook etc). In addition, we know that changing norms and epistemological trends affect language, so we might see spikes in these kinds of languages before global policy pushes such as the Convention on the Rights of the Child or the Convention on Human Rights.

Data, science, data science and trace amounts of the Middle East and the UN

Data, science, data science and trace amounts of the Middle East and the UN