Lexical Distance Among Languages of Europe 2015

I found the lexical distance map fascinating but the closer I studied it the more things bothered me. Thus trawling through the net I found Tishchenko Cyrillic versions. So I sat down this weekend and translated, adjusted, combined and updated to create my own version here:
Lexical Distance Among the Languages of Europe 2.1 mid-size

Lexical Distance Among Languages of Europe


Edit 2017.03.11: Is your language/dialect missing? Help include it in a future version! Add a Swadesh word list at Wiktionary.org and check if there is a number of speakers by Wikipedia.org.


Here a list of my changes.


First, the abbreviations. A Romance language abbreviated Pro, I found out after some research stands for Provençal, or Ga I assume stands for Scottish Gaelic. Tyshchenko’s abbreviations are in Cyrillic, those from Elms translated into Latin script. Some Latin script abbreviations correspond with ISO 639-3, some do not. I changed them all to the ISO 639-3 standard.

Second, the legend shows bubbles that represent the speakers of the language. The bubbles area correctly corresponds with the speaker size in logarithmic classes, >3000 speakers, >30 000 speakers, >300 000 speakers ect… That means that the bubble size of Ukrainian with 37 million speakers is the same size as the German or Russian bubble with 95 million speakers (in Europe) or the Icelandic bubble with 300 000 speakers has the same bubble size as languages with 2,5 to 3 million speakers. I calculated the diameter of each bubble new and adjusted them to the number of speakers in Europe of that language (source).

Third, in the Elms diagrams several Indo-European branches, Albanian, Baltic, Celtic, Germanic, Hellenic, Romance (Italic) and Slavic as well some Uralic languages are depict. Missing in Europe is the Indo-European Armenian branch (where does Europe end?), all Turkic languages, all Ibero-Caucasian, Kartvelian and the sole European Semitic Language Maltese and finally Basque. Taking the 66 150 Faroese speakers as a cut off line, there are 1 Armenian, 1 Basque, 2 Germanic, 6 Italic, 2 Kartvelian, 5 Caucasian, 1 Semetic, 8 Turkic, 2 Slavic, 14 Uralic languages missing (again depending on where Europe ends and what is considered a language). Adding Kartvelian, Turkic, Indo-Iranian, Ibero-Caucasian and the Uralic languages seemed a too daunting task. I did add many of the of the missing languages.

Additions Basque, Semitic, Indo-Iranian and Armenian:
Adding Basque and Maltese was not be that much of a hassle. Basque is 70 lexical distance to the left of Spanish and 95 to Berber (which is in North Africa and is not included). Maltese is 70 down from Italian and an undefined distance from Greek. The Indo-Iranian Romani language is wide spread and would have enough speakers to be included but diverges within itself and so wide spread it is hard to determine lexical distance. To get a lexical connection to Armenian I would have to add almost all the others that are missing, my apologies for not attempting.

Germanic Adjustments
Since Scots has no official status or clear boundary I did not include it. Luxembourgish on the other hand does and would be close to German and Dutch with a link to French, I placed it where I assume it would be but with no distances marked. Frisian is defined presently as a language group, Northern Frisian, Eastern Frisian and Western Frisian. Sadly, the northern and eastern language usage has diminished and they have 10 000 and 2 250 speakers respectively. West Frisian, with roughly 467 000 speakers, is in good health. I could not figure out which Frisian Tyshchenko was referring to, but I assume West Frisian and labelled the bubble accordingly.
Norwegian has two official written forms, Bokmål and Nynorsk. I considered combining the two but decided against it. Tyshchenko assessed both separately to determine their divergence. Which makes sense, he may not speak Norwegian and many of the other languages researched and had to rely on comparing syntax, vocabulary, morphology, vocabulary, ect… to determine lexical distance and did not have the resources to survey the Norwegian language to determine a standard Norwegian (there is none). So falling back on those two written forms is the best he could do. It also beautifully displays the relationship between Nynorsk, Bokmål and other languages. Bokmål is closer to Danish than to Nynorsk, Nynorsk is closer to Icelandic than any other mainland language.

Romance Adjustments
Back to Provençal. Elms translated провансальська as Provençal, which is probably the correct translation of the Ukrainian word. Provençal is considered an Occitan dialect and as its own language depending on who you ask. I am going to assume that Tishchenko was assessing the lexical distance of the Occitan language and re-labelled Pro as such. Or did Tishchenko mean Franco-Provençal? Probably not, the line is stronger to Spanish than French, I assume Franco-Provençal is missing and a bubble labelled Frp should be placed close to Oci with links to French and Italian. Walloon (Wln) has archaism coming from Latin and significant borrowing from Germanic languages, Dutch, Luxembourgish and German. Picard has no official status in France but does in Belgium and straddles the border between the two nations by Nord-Pas-de-Calais and Picardy. Asturian is recognized typologically and phylogenetically close to Galician-Portuguese, Castilian and less to Navarro-Aragonese (Castilian and Aragonese do not make the cut off line and are not included). The greatest number of speakers of Aromanian are found in Greek Macedonia, with substantial numbers of speakers also found in Albania, Bulgaria, Serbia, and in FYRo Macedonia which also officially recognized it. The Eastern Romance language Aromanian (Rup) has been more influenced by Greek than by Slavic compared to Romanian. I placed Rup close to Ron in the direction of Grk.

Slavic adjustments and updates

It has been a while since the Croats and Serbians have decided that they do not speak the same language and this is accurately depicted above but the Bosnians and Montenegrin also decided that they have their own language. Thus I added a Bos bubble and Mis1 (for missing ISO-code Montenegrin) right next to the Hrv and Srp bubble. By Elms’s translation there is a bubble named Sr between the Czech and Polish bubble, by Tyshchenko’s 1999 diagram there are two bubbles there. I assume the larger one is Silesian and the smaller one Sorbian, I added both there even if Sorbian does not make the cut off line.

Leaves me with 54 languages, representing 670 million people, Europe has an estimated population of 740 million. It checks out.

ISO 639-3

Abreviation

Language Branch or Family Speakers
in Europe
Bubble

Diameter

deu German Germanic 95 000 000 4.75
rus Russian Slavic 95 000 000 4.75
fra French Italic-Romance 60 000 000 3.00
ita Italian Italic-Romance 57 700 000 2.89
eng English Germanic 55 600 000 2.78
spa Spanish Italic-Romance 45 000 000 2.25
pol Polish Slavic 38 663 000 1.93
ukr Ukrainian Slavic 37 000 000 1.85
ron Romanian Italic-Romance 23 782 000 1.19
nld Dutch Germanic 21 944 000 1.10
grk Greek Hellenic 13 420 000 0.67
hun Hungarian Uralic 12 606 000 0.63
ces Czech Slavic 10 619 000 0.53
cat Catalan Italic-Romance 10 000 000 0.50
por Portuguese Italic-Romance 10 000 000 0.50
swe Swedish Germanic 9 197 090 0.46
srp Serbian Slavic 8 957 906 0.45
bul Bulgarian Slavic 8 157 770 0.41
sqi Albanian Albanian 7 400 000 0.37
hrv Croatian Slavic 5 752 090 0.29
dan Danish Germanic 5 522 490 0.28
fin Finnish Uralic 5 392 180 0.27
slk Slovak Slavic 5 187 740 0.26
nob Norwegian Bokmål Germanic 3 854 000 0.19
bel Belarusian Slavic 3 312 610 0.17
lit Lithuanian Baltic 3 001 860 0.15
glg Galician Italic-Romance 2 355 000 0.12
bos Bosnian Slavic 2 225 290 0.11
slv Slovene Slavic 2 085 000 0.10
lav Latvian Baltic 1 752 260 0.09
mkd Macedonian Slavic 1 407 810 0.07
srd Sardinian Italic-Romance 1 200 000 0.06
est Estonian Uralic 1 165 400 0.06
nno Norwegian Nynorsk Germanic 846 000 0.04
wln Walloon Italic-Romance 600 000 0.03
eus Basque Basque 545 872 0.03
cym Welsh Celtic 536 890 0.03
mlt Maltese Semitic 522 000 0.03
szl Silesian Slavic 510 000 0.03
mis1 Montenegrin Slavic 510 000 0.03
fry Western Frisian Germanic 467 000 0.02
ltz Luxembourgish Germanic 336 710 0.02
isl Icelandic Germanic 300 000 0.02
gle Irish Celtic 276 310 0.01
oci Occitan Italic-Romance 220 000 0.01
bre Breton Celtic 206 000 0.01
pcd Picard Italic-Romance 200 000 0.01
frp Franco-Provençal Italic-Romance 140 000 0.01
rup Aromanian Italic-Romance 114 340 0.01
ast Asturian Italic-Romance 110 000 0.01
gla Scottish Gaelic Celtic 68 130 0.00
fao Faroese Germanic 66 150 0.00
lat Latin Italic-Romance 30 000 0.00
wen Sorbian Slavic 30 000 0.00

Note: This is just Europe, so if you add Spanish, French, Portuguese and English from elsewhere this table would look different.

Fourth. I added a list of abbreviations and redid the distance scale and speaker categories.

Fifth. Tyshchenko gave language branches circular labels and by the version that includes Iranic also drew circles around the branches. By another version the spaces between connection lines by the branches are coloured in. This all reminded me of an Euler diagram that also shows the relationship between the branches, particularly the Celtic, Germanic and Romance circles overlap. I wanted to include this in my version and so I gave each branch and each language family its own bubble. By some I tinkered around by fading the edges to symbolise that the boundaries of language are fusing with other branches.

Sixth. I added to gravestones for Anatolian and Tocharian

Seventh. I added arrows to other languages outside of Europe.


Finally, a note on the lines that link the different language bubbles. If you look at the Germanic branch then you notice that there are links placed between English and every other Germanic language except for Swedish. Same can be observed by larger languages in Romance or Slavic. A missing line between two languages does not mean that there is no link between them; it just means that the lexical distance between these two languages has not been researched yet. Thus, for example the link between Albanian and Serbian or German and French is real but not shown.

Update 17.05.2015
An earlier version of this page had Romansh (Roh) and Latvian mislabelled, and was missing Friulian with 300 000 speakers and iso 639-3 code (Fur).

Advertisements

54 comments

  1. Excellent job, I would like to know, what those the number between an uncontinued lines means. Thank you.

    Like

    • In the Celtic group, Cornish (in South-West England) and Breton (in North West France) are very closely related (it is frequently reported that sailors from both regions were speaking mutually the same language easily with littel adaptation and were not even required to adapt their language to be fully understood; in fact there’s as littel differences between Cornish and Breton than between the 4 major variants of Breton itself, which also has several competing orthographic standards depending on these variants and the need or absence of need to distinguish local accents).

      You could also add Manx in this group (also strongly linked to both Cornish and Irish Gaelic, and some Scotish Gaelic.

      In the French group, Norman is considered a dialect of French (when historically it was not: French is born much later. The version of Norman that came to England after the Norman conquest explains the links that remain now between English and French, but Norman also includes Jersiais and Guernesiais, that are usually noit considered as part of French, even if they are strongly related to continental Norman in France). But Jersiais at least has now an official status in Jersey, and a locally approved dictionnary that ignored other forms used by Norman on the continent. In Wikipedia (depsite the fact that it has been using, is still using, a non-standard language code “nrm” in fact assigned to an unrelated language, even if now Norman has a standard code “nrf”, assigned at least for the continental form), Norman, Jersais and Guernésiais are not distinguished (Jersiais is still pending a request for a separate ISO 639-3 code instead of “nrf”, Guernésiais could follow as well).

      Like

  2. El portugués debería estar enlazado con el gallego pues proviene del gallego. En la edad media Galicia se dividió en dos surgiendo Portugal. Ambos hablaban la misma lengua ya que hasta entonces todos eran gallegos . Con los siglos ambas lenguas evolucionaron, el gallego condicionado por el castellano, pero aún hoy el gallego y portugués son prácticamente iguales , como decir el español de España y el español de Argentina.

    Like

  3. I dearly want this in large size printed for my linguist wife. Any chance one of the vector formats could be made available? Or that you could start selling prints?

    Like

    • Hi Larsrc,
      If you want to gift it to a linguist then be careful, there are inaccuracies and unclear methodology which seem to pain some linguists. I have sent a version with a higher resolution to someone for a presentation, however not in a vector format.

      I did a quick check what this would cost as a poster. A 48”x36″ poster on photo paper with shipping via USPS would cost about 154 USD. A 120cm x 90 cm photo paper with shipping within EU would cost about 96 EUR. For canvas, Sardisverlag sells maps printed on canvas for 60 EUR for an an unmodified rolled A1 print (59x84cm), up to 300 € for an individual version of an A0 (84cm x 119cm) print on stretcher frame. I guess they have a better price after ordering a bulk which they are selling now. If you dearly want it I can look into printing and sending a copy.

      Like

  4. I think you should consider placing Catalan and Occitan closer together, or at the bare minimum connect them with a line. There is a great degree of mutual intelligibility between the two, particularly with regard to the more southerly dialects of Occitan.

    See this post on Quora, for example: https://www.quora.com/Are-Occitan-Provençal-and-Catalan-mutually-intelligible

    On Reddit the question is whether the two are the same language! https://www.reddit.com/r/linguistics/comments/43njkg/are_catalan_and_occitan_the_same_language/

    Liked by 1 person

    • Historically (in the Middle Age) Occitan and Catalan were the same language (so “Old Catalan” or “Old Occitan” are refering to the same language).
      But they evolved separately in their modern form, and are now clearly separated. They have common roots, but distinct orthographies and pronunciations, and Catalan has borrowed more from from Castillan Spanish (more or less adapted) than Occitan that borrowed more forms from French (a lately created languages, which started initially with many imports from Occitan, but ignored the Catalan influence, before becoming a national standard in France only at the begining of the 20th century, when regional languages were still much more active than today: the offficial French became really predominant in France only after World War I that mixed a lot of populations from all regions on the battle fields or because of forced migrations across regions of the civil population).

      Only at this time, Occitan and Catalan usage started to decline and this accelerated only after the 1960’s, but Occitan and Catalan remain separate forms in the same family of “Oc languages”. There’s as much difference between Occitan and Catalan as between French or Castillan Spanish. Anyway Occitan is not reallyt a single unified language, it has significant dialectal differences between variants spoken in Auvergne, or Provence or Italy, and some variants in Roussillon that also mixes some forms from Catalan.

      If you think that Occitan and Catalan are the same language, you are making a confusion between “Occitan” (the modern regional language) and the “Oc languages” family (than includes all modern variants of Occitan (Essentially in France, Italy and Switzerland), all modern variants of Catalan (in France, Spain and Andorra), and old Occitan-Catalan (whose separation started in the Middle Age with the English occupation of Southwestern France while Catalan forms were protected by former catholic kingdoms in Spain before the “Reconquista” of Castille).

      The histotic invasions in France and Spain explain their difference: you cannot mix and consider that modern Occitan and modern Catalan are the same language even if they belong to the same family and have a common root (Old Occitan=Old Catalan), which is also a clearly distinct language in that family.

      Liked by 1 person

      • I’m not claiming they are the same language, but rather I use the fact that a debate exists as evidence that the two are (still) very closely related – I would say considerably more so than the graphic indicates.

        Linguists generally agree that Catalan is Occitan’s closest relative and vice versa. I speak Catalan as a second language, and certain varieties of Occitan are quite intelligible to me. And yet, you wouldn’t think so by looking at the graphic, where Occitan is displayed as closer to Portuguese than to Catalan. This just doesn’t seem right to me.

        By the way, I believe that Catalan is not considered an oc language, as the word for “yes” is “sí”, not “oc”.

        Liked by 1 person

      • Yes, I agree Occitan and Catalan should be placed differently. The language nodes were placed according to the researched connections and not all connections were researched. Thus smaller less well researched languages have sometimes unfortunate node placements.

        Like

  5. In your upcoming updated version, if you compare word spelling, you will run into trouble with isophonic letter combinations like Slavic č and ч, Polish cz, Hungarian cs, Romanian ci, etc. If you compare the pronunciation, you will sometimes lose the etymology like with Polish ó and Czech ů pronounced as /u/ while derived from “o” or Russian unstressed vowels that, variously written, are only pronounced as /a i u/. That said, I think pronunciation is the way to go. I can help you compile a uniform pronunciation list for Slavic languages. Just tell me which version of Swadesh list you use.

    Liked by 1 person

    • Same remark about Corsican (Italic-Romance, nearer from italian and Sardinian than from French, with many lexical entities from Genovese dialects in Italy, but now increasing links to French as it is the legal language in Corsica; Corsican still has some regional recognition and not uncommon to find in various places of Corsica)

      Like

      • I don’t like the fact of collecting these lists in separate unstructured pages of English Wiktionnary (in obscure links that native speakers won’t likely find of correct), when there’s the universal way to do that using pages for “Swadesh list” in each Wiktionary per language.
        Corsican has its own Wiktionnary and you should be able to use Wikidata entries (or terms in “translation” section for each defined term in its page on the native wiktionnary, and then reverse links from the links to match them)
        These Swadesh lists on EN.WIKT have many errors and are clearly insufficient and unmaintained: most contributors in EN.WIKT only know English; or their own language, but they won’t be able to navigate correctly to find these pages.
        I think you’ll get much better measures by ignoring these lists of “translations” and use instead the trnaslations listed for each term and correctly linked across wikis via wikidata (and bots checking these translation sections and filling the gaps).

        Like

      • Hi Philippe,

        I agree with you on a lot of that. I also would prefer Wikidata. I would prefer that the fields of an infobox of a language article in Wikipedia be populated by data from Wikidata. I would prefer that in Wiktionary the etymology, related terms, type of word, ect… all be automatically added to Wikidata. One could more easily produce etymology charts, calculate lexical distance, create bubble charts, ect… Sadly, those wishes are maybe a couple of years away. (Unless you can write a bot that does that?)

        I am going to link the English Wikipedia and Wiktionary, because it is a de-facto lingua franca, it will be most likely the first or second to be corrected by a native speaker. Plus, about 65% of the people that read this blog are not from an English speaking country.

        Like

      • “Plus, about 65% of the people that read this blog are not from an English speaking country.” you say, but still you use only page in the English Wiktionary that these people will never read or contribute to even if there are errors or serious lacks of information about waht is said there about their own language

        Like

    • Hi rafszul,

      The graphic was originally done as .wmf file, but it can be transferred to any vector graphic file, e.g. to a .svg. Then, one could add hyperlinks to the individual labels or bubbles to, lets say their corresponding wiki article or Glottolog info. One could also embed that in java and have labels and extra information pop up next to a bubble if you hoover over it with your mouse pointer.

      One could calculate an exact 2D or 3D position of each language node see: https://marcinciura.wordpress.com/2016/06/22/warping-maps-with-svd/

      Or try out a https://gephi.org/ diagram.

      One could add time, and have the bubble sizes vary and move around according to how their spelling assumed spelling changed, similar to a gapminder chart https://bost.ocks.org/mike/nations/. For example compare Gothic to Latin, then Medieval Germanic languages to Romance languages, and finally modern languages to each other, interpolate between data points and populate population lists.

      Or one could assign population center points with coordinates on a map and then coordinates for their lexical relationships, then press a button and jump between those two, see eg. http://www.nytimes.com/interactive/2012/02/13/us/politics/2013-budget-proposal-graphic.html?_r=0 by shancarter or http://vallandingham.me/bubble_charts_in_d3.html by Jim Vallandingham.

      BUT, first, most of this I can not include in a basic wordpress blog, second I am working on putting data together for a more complete and documented update and that takes up most of my linguistic time, and third I have no experience in programing any of these interactive visualization possibilities (apart from adding hyperlinks).

      Like

    • Also using Wikidata links, you’ll get support for many more languages. It is also much easier to process by bots to get more relevant statistics on many more terms than the very anglo-centered and too limited Swadesh list.

      Like

  6. Really nice piece of work – now getting some attention on Reddit.

    However “Since Scots has no official status or clear boundary I did not include it. ”
    No official status? According to Wikipedia:
    Classified as a “traditional language” by the Scottish Government.
    Classified as a “regional or minority language” under the European Charter for Regional or Minority Languages, ratified by the United Kingdom in 2001.
    Classified as a “traditional language” by The North/South Language Body.

    No clear boundary? Not sure what you mean by that. It has a *very* clear linguistic and physical boundary from English….
    Anyway, it doesn’t really matter since just putting in on would be irrelevant – what would be nice would to have had the actual lexical connection research to see how Scots related. I think it might be surprising. Much more Germanic than English but also with Scandi language and Gaelic connections…?

    Like

  7. Awesome map. I am very thankful. I can’t spot ‘lat’ for latin though, maybe hidden under the Italian circle?

    Like

  8. Your map is awesome and I’m thankful for finding it. I can’t spot ‘lat’ though, maybe is hidden behind the Italian circle? 🙂

    Like

  9. A cool body of work. I think that the larger Sor bubble is Upper Sorbian (Hsb, closer to Czech) while the smaller one is Lower Sorbian (Dsb, closer to Polish). The separate existence of Silesian is both dubious and recent. Also, how about renaming the conspicuous Mis1 to Crn?

    Like

    • Do you think the Sor bubbles are Upper and Lower on Tyshchenko’s originals? See Post: “how-much-does-language-change-when-it-travels”

      As to Montenegrin, I went with what SIL defines in ISO 639-3, when I do an update of this and Montenegrin has an official abbreviation I will include it.

      Like

      • Yes. Although I cannot read the labels in the bubbles on the 1997 black-and-white diagram, look at the legend in the upper left corner. You’ll see “вл верхньолужицька” (Upper Lusatian) and “нл нижньолужицька” (Lower Lusatian). The “вл” circle must be the larger and closer to Czech one.

        Like

      • While you are at it, there’s also Kashubian (csb) with 100,000 speakers, somewhere near Polish, and perhaps Rusyn (rue) with 70,000 speakers, shifted a bit towards Slovak and Polish from Ukrainian, yet further from Belarusian than your current diagram would imply.

        Like

      • Well the next version I would like to do would include everything from Semitic across to Turkic and further South to Iranic based off of Tyshchenko’s larger version. But this is lexical distance, so to include Kashubian and Rusyn (Silesian, Luxembourgish) I should compile a shorter Swadesh–Yakhontov list and compare them to their neighbors.

        Liked by 1 person

  10. Hello, I’m trying to do use these result for a sociological problem in some european cities
    Is there some way to obtain the numerical data that has driven these graphics?
    Thank you very much

    Like

  11. I am looking for the lexical distance between German and Yiddish. Would anyone know of work done on that pair? I don’t really know how to read these charts for comparison, since the two language pairs that I know that are similar, aren’t listed here, namely German and Yiddish, and Hebrew and Aramaic. Both pairs are lexically close, but I just don’t know how close.

    Liked by 1 person

    • Hebrew and Aramaic would be interesting as well as other Semitic languages, I have no lexical distances of them. Yiddish is sadly quite small at present, I included Luxembourgish even though I did not have a distance and estimated where it would be, my estimation is Yiddish would be to the right of Luxembourgish, a little closer to German at about 10 lexical distance and on a line between German and Slavic branch. As to how to read these diagrams, I just put up a new blog post with my interpretation.

      Like

    • Thanks for a nice presentation!
      Concerning Yiddish it seems I spotted it on the original langmap next to HIM (NEMetskij, German), abbreviated as ÏДД (Jiddiš, I’d say).

      Liked by 1 person

      • Jiddiš with to “D”s , with connections to German, Danish and Dutch (no lexical distance numbers). I guess you are correct, it bothers me that the bubble is so big, but the Frisian Bubble is also that big. Since there are no numbers, I will estimate it’s position and include it in the next update.

        Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s