Lexical Distance, a hoax?

Recently Boban Arsenijevic tore the Lexical Distance Diagram apart and placed some fair critiques of it in the blog post “Lexical Distance Among the Languages of Europe: a hoax, a political pamphlet or actual popular science?

Finishing off with:

I think that it is important that according to my little investigation – unless someone manages to find the specification of its methodology – which I doubt will happen, and this methodology proves adequate – the chart should be treated as a hoax with a political agenda. But its idea remains beautiful, and I would like to see a chart of the same kind, plotted from the real linguistic data.

Similar criticism has been voiced elsewhere. I will attempt to sway Arsenijevic’s opinion and clarify some things.

I am sure that when Prof. Tyshchenko did his research, it was an enormous amount of work. Considering that most, if not all of it was done manually in the 70s and 80s without electronic dictionaries as a resource. I have attempted to reach out to Prof. Tishchenko to confirm the methodologygtx, regrettably with no success.

Lets be clear what this shows and what it does not show. It compares written language and not pronounced language. It has nothing to do with grammar, syntax, rhythm or other important features that are important for intelligibility. It also compares a small list of words and not the entire vocabulary of one language to another. Many letters are pronounced differently in different languages (e.g. English V has the same pronunciation as German W, or Y & J). There are words pronounced the same in two different languages that are spelt differently (Arsenijevic’s example Russian раздел and Ukrainian роздiл, or English fish to German Fisch). There are words that retained letters in their spelling which are now silent in one language but pronounced in another (e.g. English knife to Swedish Kniv, or English knot to German Knoten) and there are even false silent letters added to suggest a closer linguistic connection (e.g. the S in English island to the Latin insula) or the spelling of one language has had a disproportionate influence on another (e.g. Bulgarian to Russian or Latin to many of the Romance languages). All of these lexical peculiarities have skewed the lexical distance between languages away from the pronunciation distance between languages.

I would argue it is possible and interesting to compare dictionaries and the spelling in different languages and then come up with a “lexical” distance diagram. If we were to compare the pronunciation of lists of words to each other, for example by determining an IPA form for each word, then we would be able to produce a “pronunciation” distance diagram! That would be more interesting than the lexical diagram but no one has ever done the work for that yet, and thus we are stuck with what we have.

A common misunderstanding of the Lexical Distance Diagram is that a line between two languages means they are related more than they are to other languages they have no line to. A missing line between two languages does not mean that there is no link between them; it just means that the lexical distance between these two languages has not been researched yet. Also, because first the connections were compiled and then the languages placed according to their researched connections, some language are placed in a different location than if they had had more connections determined (example Occitan would be closer to Catalan).

My best guess at the Methodology.

Prof. Tishchenko probably manually determined each Lexical Distance between language pairs with Levenshtein Edit Distance using a Dolgopolsky № 15 list or Swadesh № 100 or № 207 list. For example comparing Bulgarian жена (woman) to Russian женщина ‎the deletion of щ, и, н would give you a 3 LD, Bulgarian жена to Ukrainian жінка with е → і, and insertion of к would give you 2 LD. Bulgarian година (year) to Russian год is 3 LD, Bulgarian година to Ukrainian рік is 6 LD. Maybe Tishchenko compared about 15 words and then added up the LDs or compared 207 words but did not count letter replacements, insertions and deletes doubled.

I believe my guess is not too far off. If I can not confirm it with Tishchenko then maybe I can recreate it. I calculated the Levenshtein Distance for a set of Germanic languages using 13 words. This is the result.

 |00|13|29|30|30|28|26|26|32|40|33|34|39|37|38|42|38|36|37 English
 |13|00|33|31|34|31|26|30|33|40|36|39|42|37|42|44|41|42|36 Scotts
 |29|33|00|07|15|15|28|26|25|35|29|33|38|33|37|37|39|37|41 Dutch
 |30|31|07|00|17|19|30|25|23|34|31|34|34|31|36|37|35|33|40 Afrikaans
 |30|34|15|17|00|22|30|24|21|36|24|30|38|34|37|37|37|32|40 Low Saxon
 |28|31|15|19|22|00|28|31|30|33|28|34|37|34|40|39|38|38|44 Limburgs
 |26|26|28|30|30|28|00|26|27|39|33|38|42|37|39|43|42|41|40 West Frisian
 |26|30|26|25|24|31|26|00|29|41|32|32|40|37|43|45|39|36|43 Saterland Frisian
 |32|33|25|23|21|30|27|29|00|41|36|37|37|36|38|37|37|34|39 North Frisian
 |40|40|35|34|36|33|39|41|41|00|27|40|45|42|50|51|45|45|51 Luxembourgish
 |33|36|29|31|24|28|33|32|36|27|00|30|39|34|42|40|38|37|45 German
 |34|39|33|34|30|34|38|32|37|40|30|00|38|34|36|36|36|36|42 Yiddish
 |39|42|38|34|38|37|42|40|37|45|39|38|00|19|24|27|04|18|42 Danish
 |37|37|33|31|34|34|37|37|36|42|34|34|19|00|28|28|20|19|39 Swedish
 |38|42|37|36|37|40|39|43|38|50|42|36|24|28|00|10|21|20|44 Faroese
 |42|44|37|37|37|39|43|45|37|51|40|36|27|28|10|00|25|20|43 Icelandic
 |38|41|39|35|37|38|42|39|37|45|38|36|04|20|21|25|00|15|41 Norwegian (bokmål)
 |36|42|37|33|32|38|41|36|34|45|37|36|18|19|20|20|15|00|38 Norwegian (nynorsk)
 |37|36|41|40|40|44|40|43|39|51|45|42|42|39|44|43|41|38|00 Sranan

just to compare Swedish, I get 19 (Tyschenko got 21) to Danish, 28 (26) to Icelandic, 19 (16) to Norwegian (bokmål). By German I get, 33 (49) to English, 29 (25) to Dutch, 30 (41) to Danish. My method and list does not match up to Tyschenko’s but at times I am pretty close.

Levenshtein Edit Distance is blunt, changing a p → b is the same Lexical Distance as p → k, changing an a → i is the same Lexical Distance as a a → o. To fix this for a written language comparison one could assign a LD value to every change. For example p → b would be 0.5 LD, d → th would be 0.6 LD, p → k would be 1.2 LD. Vincent has done something similar here elinguistics.net.

Even more elegant would be comparing a mix of grammar, syntax, rhythm and pronounced language via comparing words written in IPA. Sadly, few word lists have IPA versions. It remains beautiful idea, and I am working to define a method and compiling data for a version that might be satisfactory for Arsenijevic.


  1. Thanks for your attempt at edification. I just saw this diagram on Teresa Elms’ blog, directed there from a discussion on Quora. I am a statistician by trade, linguistics is my hobby. The concept of the graph I find very attractive, but I immediately wondered about the methodology, and particularly about the display.

    The definition of distance itself is problematic, and you discuss the issues above. But another aspect is: How do you display the distances graphically? The graph links circles with line segments which are coded according to similarity. But it is not at all clear how well the geometric placement reflects these distances. In particular, what does the lack of an edge between two circles mean? You suggest it just means “insufficient data” which is not at all what the visual effect of the graph implies.

    There are statistical techniques, such as multidimensional scaling or principal components that allow one to present, in two dimensions, a reasonable summary of the distance matrix. From what I hear, nothing so sophisticated was used to come up with this diagram. Are linguists conversant with these statistical methods?


    • I agree, that showing this data in a graph like this is an attractive but flowed method. I think that Elms’ version, to some degree, is better than my version. Her graph is the equivalent of showing a subway system as a diagram in Vignelli‘s style, giving the viewer a clear mental image of connections and relationships, whereas my version is more messy, less clear cut and the equivalent of this 1948 transit map. But both versions are the attempts to show a matrix of data with a multitude of dimensions only in 2 dimensions. Subway systems are much simpler with the number of dimensions they traverse. I did try to explain this dilemma in this blog post Dimensional limits of Graphs. You mentioned multidimensional scaling, which we I have attempted with Marcin and Vincent see Visualizing Lexical Distance in Three Dimensions. My 2015 version used nothing so sophisticated.

      However, I feel the visualization of in three dimensions is again not as attractive then to the casual viewer compared to the earlier 2D versions.

      I think either either it should be made into an interactive version, similar to this work from Daniel Probst.
      or since the MDS warps these data nodes in three dimensions into a rough sphere shape, I am looking for algorithm that would scale the connections so that the nodes are distributed over the surface of a sphere in 3D. Thus the connections between the individual languages would be scaled up or down but the distance from the language node to the center of the sphere is always the same. This would allow me to manually draw a “attractive” global map of the languages with calculated language node placements.

      My question to you is, do you know of a method of multidimensional scaling limited to the surface of a sphere?


      • Thanks for the references. I don’t specifically know of an MDS method for spherical data, but it is an attractive idea. If you look at the set of correlations between objects (i.e. languages), then the inverse cosines of these correlations would correspond to angles (i.e., great-circle distances) between points on the sphere. So in general, n languages could be mapped, without information loss, onto an n-2 dimensional sphere. What you would be looking for is a projection of such a representation onto a 2-sphere with minimal information loss. Depending on how you define information loss, my gut feeling is that the representation would be the same or similar to what you would get from the first three principal components of the correlation matrix. There remains the rather complicated question of how one would define a “correlation” between languages. An idea that jumps into my head is using the probability of mutual comprehension. How likely is a native speaker of one language to understand a random utterance in another language? This could be determined empirically by testing a sample of speakers of various languages. Could one treat these probabilities as correlations? Correlations can be positive or negative. Does it make sense for two languages to have a negative correlation? Would that mean that one is more likely to misunderstand the foreign utterance (too many false friends)? On the other hand, if we consider zero correlation as the farthest distance, then the graphical representation would be restricted to the first octant, which is just a stretched triangle.

        I am just thinking as I type — some thoughts you might consider.

        Liked by 1 person

  2. To a linguist (such as myself) there is no such thing as a ‘written language’, writing is a technology used to represent some aspect of language (ie spoken or signed). And there has been a huge amount of work comparing lists of words, usually written phonemically (which is different to being written in IPA) to figure out their relationships. This is called lexicostatistics (https://en.wikipedia.org/wiki/Lexicostatistics) and is commonly used to carry out the initial work of figuring out a family tree structure for a group of related languages. A huge problem in carrying out this work is borrowings, which can obscure actual genetic relationships between languages.

    As far as I can make out, Tyshenko’s lexical distance calculations might give some insight into how much mutual comprehension of vocabulary there would be between speakers of pairs of languages, but of course morphology and syntax would have a big effect on that.


  3. Thanks a lot, it’s very clearly explained here, and it clarifies a lot. I have nothing to add, except that there are too many potential confounds for this type of analysis to expect reliable results.

    Liked by 1 person

    • Hi everyone !

      Can You tell me why there is a significant difference between distances and positions among Slavic languages in 2 pictures:

      Picture 1
      Picture 2

      Look e.g. at Slovak, Bulgarian positions and distances among other Slavic lang.

      AlternativeTransport Edit: Linked Pictures and combined text to other question.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s