Working with data about political units – countries, states, counties, communes, provinces, regions, etc. – is a bit of a nightmare. The names of the units frequently have variations and so one has to either be clever or spend a lot of time manually matching them up. Such work is error-prone and demotivating, so we want to be smart instead. The solution is standardization. However, since we cannot get people to agree to use the same spellings of everything, especially people who speak different languages, we have to go to the next level: abbreviations. There are already ISO abbreviations for political units (ISO-3166), but I’m not aware of any R way to use them efficiently. So I developed my own method. It consists of two things: a large dataset of names of political units and the matching abbreviations (currently based on 3166-1 alpha-3, but I should switch to 3166-2 which already has all the first-level administrative divisions of all countries in the world, N=3600), and a clever function for working with them. Let’s see some examples:
#try an Argentinian provinces
v_argentinian_provinces
## [1] "Buenos Aires" "Buenos Aires City (DC)"
## [3] "Catamarca" "Chaco"
## [5] "Chubut" "Córdoba"
## [7] "Corrientes" "Entre Ríos"
## [9] "Formosa" "Jujuy"
## [11] "La Pampa" "La Rioja"
## [13] "Mendoza" "Misiones"
## [15] "Neuquén" "Río Negro"
## [17] "Salta" "San Juan"
## [19] "San Luis" "Santa Cruz"
## [21] "Santa Fe" "Santiago del Estero"
## [23] "Tierra del Fuego" "Tucumán"
pu_translate(v_argentinian_provinces, messages = 2)
## Loading required package: XLConnectJars
## XLConnect 0.2-12 by Mirai Solutions GmbH [aut],
## Martin Studer [cre],
## The Apache Software Foundation [ctb, cph] (Apache POI, Apache Commons
## Codec),
## Stephen Colebourne [ctb, cph] (Joda-Time Java library),
## Graph Builder [ctb, cph] (Curvesapi Java library)
## http://www.mirai-solutions.com ,
## http://miraisolutions.wordpress.com
## Exact match: Buenos Aires -> ARG_B
## Exact match: Buenos Aires City (DC) -> ARG_C
## Exact match: Catamarca -> ARG_K
## Exact match: Chaco -> ARG_H
## Exact match: Chubut -> ARG_U
## Error: More than one exact match for Córdoba. Perhaps you want to use a superunit to limit the options? Matches were: COL_COR | ARG_X
Here we get an error because the name we are trying to translate to an abbreviation matches multiple candidates and the algorithm doesn’t konw which one to pick. But since we know that these are all Argentinian units, we can limit our search to Argentinian units:
#limit to Argentinian political units
pu_translate(v_argentinian_provinces, messages = 2, superunit = "ARG")
## Subsetting to subunits of Argentina [ARG].
## Exact match: Buenos Aires -> ARG_B
## Exact match: Buenos Aires City (DC) -> ARG_C
## Exact match: Catamarca -> ARG_K
## Exact match: Chaco -> ARG_H
## Exact match: Chubut -> ARG_U
## Exact match: Córdoba -> ARG_X
## Exact match: Corrientes -> ARG_W
## Exact match: Entre Ríos -> ARG_E
## Exact match: Formosa -> ARG_P
## Exact match: Jujuy -> ARG_Y
## Exact match: La Pampa -> ARG_L
## Exact match: La Rioja -> ARG_F
## Exact match: Mendoza -> ARG_M
## Exact match: Misiones -> ARG_N
## Exact match: Neuquén -> ARG_Q
## Exact match: Río Negro -> ARG_R
## Exact match: Salta -> ARG_A
## Exact match: San Juan -> ARG_J
## Exact match: San Luis -> ARG_D
## Exact match: Santa Cruz -> ARG_Z
## Exact match: Santa Fe -> ARG_S
## Exact match: Santiago del Estero -> ARG_G
## Exact match: Tierra del Fuego -> ARG_V
## Exact match: Tucumán -> ARG_T
## Buenos Aires Buenos Aires City (DC) Catamarca
## "ARG_B" "ARG_C" "ARG_K"
## Chaco Chubut Córdoba
## "ARG_H" "ARG_U" "ARG_X"
## Corrientes Entre Ríos Formosa
## "ARG_W" "ARG_E" "ARG_P"
## Jujuy La Pampa La Rioja
## "ARG_Y" "ARG_L" "ARG_F"
## Mendoza Misiones Neuquén
## "ARG_M" "ARG_N" "ARG_Q"
## Río Negro Salta San Juan
## "ARG_R" "ARG_A" "ARG_J"
## San Luis Santa Cruz Santa Fe
## "ARG_D" "ARG_Z" "ARG_S"
## Santiago del Estero Tierra del Fuego Tucumán
## "ARG_G" "ARG_V" "ARG_T"
And we get the right results. What about translating back into names from abbreviations? We can do that too. There is support for multiple languages, but by default the top name for that abbreviation in the dataset is used which is the preferred English one. If it can’t find a local variant, it defaults to the English one. In many cases, these are the same:
#translate back names
v_abbrevs = c("DNK", "THA", "USA", "GBR", "SWE", "NOR", "DEU", "VEN", "TUR", "NLD")
#transback back in EN
pu_translate(v_abbrevs, messages = 2, reverse = T)
## DNK THA USA GBR
## "Denmark" "Thailand" "USA" "United Kingdom"
## SWE NOR DEU VEN
## "Sweden" "Norway" "Germany" "Venezuela"
## TUR NLD
## "Turkey" "Netherlands"
#transback back in DA
pu_translate(v_abbrevs, messages = 2, reverse = T, lang = "da")
## There was no match in language da for THA
## There was no match in language da for USA
## There was no match in language da for VEN
## DNK THA USA GBR
## "Danmark" "Thailand" "USA" "Storbritannien"
## SWE NOR DEU VEN
## "Sverige" "Norge" "Tyskland" "Venezuela"
## TUR NLD
## "Tyrkiet" "Holland"
One can comebine the two methods to standardize names, i.e. change all name variants to the standard spellling. Here I’ve mixed Danish, Italian and some less common English variants:
#standardize names
v_odd_names = c("Danmark", "Viet Nam", "Belgia", "Bolivia, Plurinational State of")
pu_translate(v_odd_names, messages = 2, standardize_name = T)
## Exact match: Danmark -> DNK
## Exact match: Viet Nam -> VNM
## Exact match: Belgia -> BEL
## Exact match: Bolivia, Plurinational State of -> BOL
## Danmark Viet Nam
## "Denmark" "Vietnam"
## Belgia Bolivia, Plurinational State of
## "Belgium" "Bolivia"
What if there is no exact match in the dataset? We default back to fuzzy matcthing using the stringdist package. Let’s try some almost right names:
v_notquiteright = c("USa", "Dnmark", "Viitnam", "Bolgium", "Boliviam")
pu_translate(v_notquiteright, messages = 2)
## No exact match: USa
## No exact match: Dnmark
## No exact match: Viitnam
## No exact match: Bolgium
## No exact match: Boliviam
## Best fuzzy match found: USa -> USA with distance 1.00
## Best fuzzy match found: Dnmark -> Denmark with distance 1.00
## Best fuzzy match found: Viitnam -> Vietnam with distance 1.00
## Best fuzzy match found: Bolgium -> Belgium with distance 1.00
## Best fuzzy match found: Boliviam -> Bolivia with distance 1.00
## USa Dnmark Viitnam Bolgium Boliviam
## "USA" "DNK" "VNM" "BEL" "BOL"
pu_translate(v_notquiteright, messages = 2, standardize_name = T)
## No exact match: USa
## No exact match: Dnmark
## No exact match: Viitnam
## No exact match: Bolgium
## No exact match: Boliviam
## Best fuzzy match found: USa -> USA with distance 1.00
## Best fuzzy match found: Dnmark -> Denmark with distance 1.00
## Best fuzzy match found: Viitnam -> Vietnam with distance 1.00
## Best fuzzy match found: Bolgium -> Belgium with distance 1.00
## Best fuzzy match found: Boliviam -> Bolivia with distance 1.00
## USa Dnmark Viitnam Bolgium Boliviam
## "USA" "Denmark" "Vietnam" "Belgium" "Bolivia"
And in each case we manage to get the right one. What happens if we punch our keyboards at random and try to get matches?
v_totallywrong = c("jinsa", "aska", "bombom!")
pu_translate(v_totallywrong, messages = 2)
## No exact match: jinsa
## No exact match: aska
## No exact match: bombom!
## Best fuzzy match found: jinsa -> Cina with distance 2.00
## Best fuzzy match found: aska -> Alaska with distance 2.00
## Error: There were more than one equally good matches that did not agree for bombom!: Roraima | Cambodia | Cambodja | Cambogia | Cocoa | Tolima | Comoros | Congo | Kongo | Gabon | Gambia | Kosovo | Colima | Sonora | Moldova | Romania | Samoa | Sao Tome | Somalia | Togo | Alabama | Zambia | Formosa. All with distance 5.00
So even random strings can produce sensible matches. The function looks for trouble. If there’s more than one match that’s equally distant which gives inconsistent results, we get an error.