Lexical Distance, a hoax?

Recently Boban Arsenijevic tore the Lexical Distance Diagram apart and placed some fair critiques of it in the blog post “Lexical Distance Among the Languages of Europe: a hoax, a political pamphlet or actual popular science?

Finishing off with:

I think that it is important that according to my little investigation – unless someone manages to find the specification of its methodology – which I doubt will happen, and this methodology proves adequate – the chart should be treated as a hoax with a political agenda. But its idea remains beautiful, and I would like to see a chart of the same kind, plotted from the real linguistic data.

Similar criticism has been voiced elsewhere. I will attempt to sway Arsenijevic’s opinion and clarify some things.

I am sure that when Prof. Tyshchenko did his research, it was an enormous amount of work. Considering that most, if not all of it was done manually in the 70s and 80s without electronic dictionaries as a resource. I have attempted to reach out to Prof. Tishchenko to confirm the methodologygtx, regrettably with no success.

Lets be clear what this shows and what it does not show. It compares written language and not pronounced language. It has nothing to do with grammar, syntax, rhythm or other important features that are important for intelligibility. It also compares a small list of words and not the entire vocabulary of one language to another. Many letters are pronounced differently in different languages (e.g. English V has the same pronunciation as German W, or Y & J). There are words pronounced the same in two different languages that are spelt differently (Arsenijevic’s example Russian раздел and Ukrainian роздiл, or English fish to German Fisch). There are words that retained letters in their spelling which are now silent in one language but pronounced in another (e.g. English knife to Swedish Kniv, or English knot to German Knoten) and there are even false silent letters added to suggest a closer linguistic connection (e.g. the S in English island to the Latin insula) or the spelling of one language has had a disproportionate influence on another (e.g. Bulgarian to Russian or Latin to many of the Romance languages). All of these lexical peculiarities have skewed the lexical distance between languages away from the pronunciation distance between languages.

I would argue it is possible and interesting to compare dictionaries and the spelling in different languages and then come up with a “lexical” distance diagram. If we were to compare the pronunciation of lists of words to each other, for example by determining an IPA form for each word, then we would be able to produce a “pronunciation” distance diagram! That would be more interesting than the lexical diagram but no one has ever done the work for that yet, and thus we are stuck with what we have.

A common misunderstanding of the Lexical Distance Diagram is that a line between two languages means they are related more than they are to other languages they have no line to. A missing line between two languages does not mean that there is no link between them; it just means that the lexical distance between these two languages has not been researched yet. Also, because first the connections were compiled and then the languages placed according to their researched connections, some language are placed in a different location than if they had had more connections determined (example Occitan would be closer to Catalan).

My best guess at the Methodology.

Prof. Tishchenko probably manually determined each Lexical Distance between language pairs with Levenshtein Edit Distance using a Dolgopolsky № 15 list or Swadesh № 100 or № 207 list. For example comparing Bulgarian жена (woman) to Russian женщина ‎the deletion of щ, и, н would give you a 3 LD, Bulgarian жена to Ukrainian жінка with е → і, and insertion of к would give you 2 LD. Bulgarian година (year) to Russian год is 3 LD, Bulgarian година to Ukrainian рік is 6 LD. Maybe Tishchenko compared about 15 words and then added up the LDs or compared 207 words but did not count letter replacements, insertions and deletes doubled.

I believe my guess is not too far off. If I can not confirm it with Tishchenko then maybe I can recreate it. I calculated the Levenshtein Distance for a set of Germanic languages using 13 words. This is the result.

 |00|13|29|30|30|28|26|26|32|40|33|34|39|37|38|42|38|36|37 English
 |13|00|33|31|34|31|26|30|33|40|36|39|42|37|42|44|41|42|36 Scotts
 |29|33|00|07|15|15|28|26|25|35|29|33|38|33|37|37|39|37|41 Dutch
 |30|31|07|00|17|19|30|25|23|34|31|34|34|31|36|37|35|33|40 Afrikaans
 |30|34|15|17|00|22|30|24|21|36|24|30|38|34|37|37|37|32|40 Low Saxon
 |28|31|15|19|22|00|28|31|30|33|28|34|37|34|40|39|38|38|44 Limburgs
 |26|26|28|30|30|28|00|26|27|39|33|38|42|37|39|43|42|41|40 West Frisian
 |26|30|26|25|24|31|26|00|29|41|32|32|40|37|43|45|39|36|43 Saterland Frisian
 |32|33|25|23|21|30|27|29|00|41|36|37|37|36|38|37|37|34|39 North Frisian
 |40|40|35|34|36|33|39|41|41|00|27|40|45|42|50|51|45|45|51 Luxembourgish
 |33|36|29|31|24|28|33|32|36|27|00|30|39|34|42|40|38|37|45 German
 |34|39|33|34|30|34|38|32|37|40|30|00|38|34|36|36|36|36|42 Yiddish
 |39|42|38|34|38|37|42|40|37|45|39|38|00|19|24|27|04|18|42 Danish
 |37|37|33|31|34|34|37|37|36|42|34|34|19|00|28|28|20|19|39 Swedish
 |38|42|37|36|37|40|39|43|38|50|42|36|24|28|00|10|21|20|44 Faroese
 |42|44|37|37|37|39|43|45|37|51|40|36|27|28|10|00|25|20|43 Icelandic
 |38|41|39|35|37|38|42|39|37|45|38|36|04|20|21|25|00|15|41 Norwegian (bokmål)
 |36|42|37|33|32|38|41|36|34|45|37|36|18|19|20|20|15|00|38 Norwegian (nynorsk)
 |37|36|41|40|40|44|40|43|39|51|45|42|42|39|44|43|41|38|00 Sranan

just to compare Swedish, I get 19 (Tyschenko got 21) to Danish, 28 (26) to Icelandic, 19 (16) to Norwegian (bokmål). By German I get, 33 (49) to English, 29 (25) to Dutch, 30 (41) to Danish. My method and list does not match up to Tyschenko’s but at times I am pretty close.

Levenshtein Edit Distance is blunt, changing a p → b is the same Lexical Distance as p → k, changing an a → i is the same Lexical Distance as a a → o. To fix this for a written language comparison one could assign a LD value to every change. For example p → b would be 0.5 LD, d → th would be 0.6 LD, p → k would be 1.2 LD. Vincent has done something similar here elinguistics.net.

Even more elegant would be comparing a mix of grammar, syntax, rhythm and pronounced language via comparing words written in IPA. Sadly, few word lists have IPA versions. It remains beautiful idea, and I am working to define a method and compiling data for a version that might be satisfactory for Arsenijevic.



  1. Thanks a lot, it’s very clearly explained here, and it clarifies a lot. I have nothing to add, except that there are too many potential confounds for this type of analysis to expect reliable results.

