I found the lexical distance map fascinating but the closer I studied it the more things bothered me. Thus trawling through the net I found Tishchenko Cyrillic versions. So I sat down this weekend and translated, adjusted, combined and updated to create my own version here:
Lexical Distance Among Languages of Europe
Here a list of my changes.
First, the abbreviations. A Romance language abbreviated Pro, I found out after some research stands for Provençal, or Ga I assume stands for Scottish Gaelic. Tyshchenko’s abbreviations are in Cyrillic, those from Elms translated into Latin script. Some Latin script abbreviations correspond with ISO 639-3, some do not. I changed them all to the ISO 639-3 standard.
Second, the legend shows bubbles that represent the speakers of the language. The bubbles area correctly corresponds with the speaker size in logarithmic classes, >3000 speakers, >30 000 speakers, >300 000 speakers ect… That means that the bubble size of Ukrainian with 37 million speakers is the same size as the German or Russian bubble with 95 million speakers (in Europe) or the Icelandic bubble with 300 000 speakers has the same bubble size as languages with 2,5 to 3 million speakers. I calculated the diameter of each bubble new and adjusted them to the number of speakers in Europe of that language (source).
Third, in the Elms diagrams several Indo-European branches, Albanian, Baltic, Celtic, Germanic, Hellenic, Romance (Italic) and Slavic as well some Uralic languages are depict. Missing in Europe is the Indo-European Armenian branch (where does Europe end?), all Turkic languages, all Ibero-Caucasian, Kartvelian and the sole European Semitic Language Maltese and finally Basque. Taking the 66 150 Faroese speakers as a cut off line, there are 1 Armenian, 1 Basque, 2 Germanic, 6 Italic, 2 Kartvelian, 5 Caucasian, 1 Semetic, 8 Turkic, 2 Slavic, 14 Uralic languages missing (again depending on where Europe ends and what is considered a language). Adding Kartvelian, Turkic, Indo-Iranian, Ibero-Caucasian and the Uralic languages seemed a too daunting task. I did add many of the of the missing languages.
Additions Basque, Semitic, Indo-Iranian and Armenian:
Adding Basque and Maltese was not be that much of a hassle. Basque is 70 lexical distance to the left of Spanish and 95 to Berber (which is in North Africa and is not included). Maltese is 70 down from Italian and an undefined distance from Greek. The Indo-Iranian Romani language is wide spread and would have enough speakers to be included but diverges within itself and so wide spread it is hard to determine lexical distance. To get a lexical connection to Armenian I would have to add almost all the others that are missing, my apologies for not attempting.
Germanic Adjustments
Since Scots has no official status or clear boundary I did not include it. Luxembourgish on the other hand does and would be close to German and Dutch with a link to French, I placed it where I assume it would be but with no distances marked. Frisian is defined presently as a language group, Northern Frisian, Eastern Frisian and Western Frisian. Sadly, the northern and eastern language usage has diminished and they have 10 000 and 2 250 speakers respectively. West Frisian, with roughly 467 000 speakers, is in good health. I could not figure out which Frisian Tyshchenko was referring to, but I assume West Frisian and labelled the bubble accordingly.
Norwegian has two official written forms, Bokmål and Nynorsk. I considered combining the two but decided against it. Tyshchenko assessed both separately to determine their divergence. Which makes sense, he may not speak Norwegian and many of the other languages researched and had to rely on comparing syntax, vocabulary, morphology, vocabulary, ect… to determine lexical distance and did not have the resources to survey the Norwegian language to determine a standard Norwegian (there is none). So falling back on those two written forms is the best he could do. It also beautifully displays the relationship between Nynorsk, Bokmål and other languages. Bokmål is closer to Danish than to Nynorsk, Nynorsk is closer to Icelandic than any other mainland language.
Romance Adjustments
Back to Provençal. Elms translated провансальська as Provençal, which is probably the correct translation of the Ukrainian word. Provençal is considered an Occitan dialect and as its own language depending on who you ask. I am going to assume that Tishchenko was assessing the lexical distance of the Occitan language and re-labelled Pro as such. Or did Tishchenko mean Franco-Provençal? Probably not, the line is stronger to Spanish than French, I assume Franco-Provençal is missing and a bubble labelled Frp should be placed close to Oci with links to French and Italian. Walloon (Wln) has archaism coming from Latin and significant borrowing from Germanic languages, Dutch, Luxembourgish and German. Picard has no official status in France but does in Belgium and straddles the border between the two nations by Nord-Pas-de-Calais and Picardy. Asturian is recognized typologically and phylogenetically close to Galician-Portuguese, Castilian and less to Navarro-Aragonese (Castilian and Aragonese do not make the cut off line and are not included). The greatest number of speakers of Aromanian are found in Greek Macedonia, with substantial numbers of speakers also found in Albania, Bulgaria, Serbia, and in FYRo Macedonia which also officially recognized it. The Eastern Romance language Aromanian (Rup) has been more influenced by Greek than by Slavic compared to Romanian. I placed Rup close to Ron in the direction of Grk.
Slavic adjustments and updates
It has been a while since the Croats and Serbians have decided that they do not speak the same language and this is accurately depicted above but the Bosnians and Montenegrin also decided that they have their own language. Thus I added a Bos bubble and Mis1 (for missing ISO-code Montenegrin) right next to the Hrv and Srp bubble. By Elms’s translation there is a bubble named Sr between the Czech and Polish bubble, by Tyshchenko’s 1999 diagram there are two bubbles there. I assume the larger one is Silesian and the smaller one Sorbian, I added both there even if Sorbian does not make the cut off line.
Leaves me with 54 languages, representing 670 million people, Europe has an estimated population of 740 million. It checks out.
ISO 639-3
Abreviation |
Language | Branch or Family | Speakers in Europe |
Bubble
Diameter |
deu | German | Germanic | 95 000 000 | 4.75 |
rus | Russian | Slavic | 95 000 000 | 4.75 |
fra | French | Italic-Romance | 60 000 000 | 3.00 |
ita | Italian | Italic-Romance | 57 700 000 | 2.89 |
eng | English | Germanic | 55 600 000 | 2.78 |
spa | Spanish | Italic-Romance | 45 000 000 | 2.25 |
pol | Polish | Slavic | 38 663 000 | 1.93 |
ukr | Ukrainian | Slavic | 37 000 000 | 1.85 |
ron | Romanian | Italic-Romance | 23 782 000 | 1.19 |
nld | Dutch | Germanic | 21 944 000 | 1.10 |
grk | Greek | Hellenic | 13 420 000 | 0.67 |
hun | Hungarian | Uralic | 12 606 000 | 0.63 |
ces | Czech | Slavic | 10 619 000 | 0.53 |
cat | Catalan | Italic-Romance | 10 000 000 | 0.50 |
por | Portuguese | Italic-Romance | 10 000 000 | 0.50 |
swe | Swedish | Germanic | 9 197 090 | 0.46 |
srp | Serbian | Slavic | 8 957 906 | 0.45 |
bul | Bulgarian | Slavic | 8 157 770 | 0.41 |
sqi | Albanian | Albanian | 7 400 000 | 0.37 |
hrv | Croatian | Slavic | 5 752 090 | 0.29 |
dan | Danish | Germanic | 5 522 490 | 0.28 |
fin | Finnish | Uralic | 5 392 180 | 0.27 |
slk | Slovak | Slavic | 5 187 740 | 0.26 |
nob | Norwegian Bokmål | Germanic | 3 854 000 | 0.19 |
bel | Belarusian | Slavic | 3 312 610 | 0.17 |
lit | Lithuanian | Baltic | 3 001 860 | 0.15 |
glg | Galician | Italic-Romance | 2 355 000 | 0.12 |
bos | Bosnian | Slavic | 2 225 290 | 0.11 |
slv | Slovene | Slavic | 2 085 000 | 0.10 |
lav | Latvian | Baltic | 1 752 260 | 0.09 |
mkd | Macedonian | Slavic | 1 407 810 | 0.07 |
srd | Sardinian | Italic-Romance | 1 200 000 | 0.06 |
est | Estonian | Uralic | 1 165 400 | 0.06 |
nno | Norwegian Nynorsk | Germanic | 846 000 | 0.04 |
wln | Walloon | Italic-Romance | 600 000 | 0.03 |
eus | Basque | Basque | 545 872 | 0.03 |
cym | Welsh | Celtic | 536 890 | 0.03 |
mlt | Maltese | Semitic | 522 000 | 0.03 |
szl | Silesian | Slavic | 510 000 | 0.03 |
mis1 | Montenegrin | Slavic | 510 000 | 0.03 |
fry | Western Frisian | Germanic | 467 000 | 0.02 |
ltz | Luxembourgish | Germanic | 336 710 | 0.02 |
isl | Icelandic | Germanic | 300 000 | 0.02 |
gle | Irish | Celtic | 276 310 | 0.01 |
oci | Occitan | Italic-Romance | 220 000 | 0.01 |
bre | Breton | Celtic | 206 000 | 0.01 |
pcd | Picard | Italic-Romance | 200 000 | 0.01 |
frp | Franco-Provençal | Italic-Romance | 140 000 | 0.01 |
rup | Aromanian | Italic-Romance | 114 340 | 0.01 |
ast | Asturian | Italic-Romance | 110 000 | 0.01 |
gla | Scottish Gaelic | Celtic | 68 130 | 0.00 |
fao | Faroese | Germanic | 66 150 | 0.00 |
lat | Latin | Italic-Romance | 30 000 | 0.00 |
wen | Sorbian | Slavic | 30 000 | 0.00 |
Note: This is just Europe, so if you add Spanish, French, Portuguese and English from elsewhere this table would look different.
Fourth. I added a list of abbreviations and redid the distance scale and speaker categories.
Fifth. Tyshchenko gave language branches circular labels and by the version that includes Iranic also drew circles around the branches. By another version the spaces between connection lines by the branches are coloured in. This all reminded me of an Euler diagram that also shows the relationship between the branches, particularly the Celtic, Germanic and Romance circles overlap. I wanted to include this in my version and so I gave each branch and each language family its own bubble. By some I tinkered around by fading the edges to symbolise that the boundaries of language are fusing with other branches.
Sixth. I added to gravestones for Anatolian and Tocharian
Seventh. I added arrows to other languages outside of Europe.
Finally, a note on the lines that link the different language bubbles. If you look at the Germanic branch then you notice that there are links placed between English and every other Germanic language except for Swedish. Same can be observed by larger languages in Romance or Slavic. A missing line between two languages does not mean that there is no link between them; it just means that the lexical distance between these two languages has not been researched yet. Thus, for example the link between Albanian and Serbian or German and French is real but not shown.
Update 17.05.2015
An earlier version of this page had Romansh (Roh) and Latvian mislabelled, and was missing Friulian with 300 000 speakers and iso 639-3 code (Fur).
[…] Among subjects that interest me, Stephan covers data visualization. At one time, a picture from his post on lexical distance among languages of Europe went viral. Since then, he has been pondering ways to improve the illustration by including more […]
LikeLike
[…] is a fair question. If you study the Lexical Distance Among Languages of Europe 2015 graphic, you see that what is plotted does not always match up with what is labelled. I previously […]
LikeLike
Also Breton [BRE] should be very close to Welsh [CYM]. Still at the beginning of the 20th centuries, both languages were largely mutually intelligible (notably between fishers). Breton should then be more to the North. And I don’t see why the link between GLA and GLE (Scottish Gaelic and Irish) is curved when the links between them is evident and should be straight, creating two subgroups in Celtic languages:
GLE+GLA, and CYM+BRE
LikeLike
The map seems unclear.
Romance languages aren’t well represented. Portuguese and Galician should be connected because they are two languages that are originated from the same language(Galician-Portuguese or Old Portuguese), and the map seems to represent that Portuguese and Galician came from Spanish, which makes no sense. The case of Occitan and Catalan could also have had a better study.
In the case of the Germanic languages, the Frisian has a strong connection with the Nordic languages.
LikeLike
I agree that Frisian (FRY) could be a bit more higher in the graph, located just above (instead of below) the link between German and English.
But the given distance of English and German (49) seems a bit too low, and in fact (modern) English is nearer from French than German (this was not true for old Medieval English/Aenglic/Anglo-Saxon before the Norman invasion) with over 60% of the English terms coming from Italic languages (Norman, then French, but also Church Latin before the Anglican split) and that replaced many terms from Old English.
English has kept many terms from “Law French”, including in its current legislation. There’s just a small (but very used) base Germanic substrat today in English, but many things have disappeared.
In fact the distance between English and (High) German is too high for another reason: Old English derived from Low German (whose descenders are Dutch, Frisian, Lower Saxon, and Plautdietsch) and never from High German.
So German should be farer to the right/east (and bit higher to the top/North, English nearer from French and Frisian even if they all remain in the “Germanic” group And Frisian should be effectively connected with Danish or Swedish (both languages being very near from each other and having a common history with the same ruling kingdoms, or with the Hanseatic states).
Dutch, Frisian, Low Saxon, and English are their own group we can call “Anglo-Saxon” clearly distinguished from other High Germanic languages, and from other “Nordic” languages (Swedish, Danish, Norwegian, Faeroese, Icelandic).
The link between Nordic languages and Finnic languages (Finnish, Estonian, Latgalian, Carelian…) is not evident at all. In fact Finnish is nearer from other Uralic languages, notably Hungarian. But Finland has been under two ruling influences: the kingdom of Sweden and the former Russian Empire, so Finnish has captured some terms from both Nordic-Germanic (Switch) and Slavic (Russian, and by formerly larger Polish kingdom for a short period before it was dismantled by Russia and Germany), while keeping its exclusive Uralic form (notably complex agglutination and the most common lexical roots, some of them shared with Hungarian and other minority Uralic languages still spoken in Russia, plus some terms from Eskimo-Aleut, like also in Swedish, Russian, Danish, Greenlandic, and minority languages in North Canada, Alaska, Bering islands, and Far East Russia).
LikeLike
[…] Lexical Distance Among Languages of Europe 2015, Stephen Steinbach, 17 mai 2015. […]
LikeLike
[…] Source: alternativetransport.wordpress.com […]
LikeLike
[…] I have received many of requests to update this graphic that tries to shows the relationship between languages to include more languages, mostly along the lines “my language is missing, can you include […]
LikeLike
Great list, but it is a big omission not to include distances from Latin.
LikeLike
About 20 to Italian, 23 to Portuguese, 27 to Provencal, 33 to French, 36 to Romanian, 28 to 33 to Sardinian, 57 to Greek, 53 to German, 54 to English, 59 to Croatian.
LikeLike
Hello – this is a super piece of work, thanks so much! It’d be amazing to see how Georgian, Turkish, Azeri and Armenian link in – I know you’ve said Armenian would take a lot of work so I assume the others would be even more (?) but if you got a chance any time it would be interesting.
LikeLike
Hi! I could really use this information to help my uni project!! Would you be able to send the numerical data which you made this graphic from? Perhaps as a spreadsheet if you were to have it?
LikeLiked by 1 person
Sure, download it here.
https://alternativetransport.wordpress.com/lexical-distance-matrix/
LikeLike
[…] Les algorithmes d’apprentissage automatique non supervisés sont utilisés pour répondre à des problèmes très différents des précédents. En général, les questions auxquelles ils peuvent répondre sont des questions ouvertes où il n’y a pas de « bonne » réponse. Il existe plusieurs sous-catégories d’algorithmes de machine learning non supervisés. La plus connue est probablement celle composée d’algorithmes de clustering. Ces algorithmes visent à partitionner les données en groupes contenant des individus « similaires ». Pour ce faire, ces algorithmes ont besoin d’une notion de distance entre deux lignes de données. Par exemple, il va être assez naturel de grouper des données selon des critères de similarités comme l’âge, ou la taille. En revanche, il est moins évident d’identifier une distance entre le goût de deux Whiskey (cf. article de Lapointe et Legendre (1994)) ou encore deux langues européennes (cf. Lexical distance among languages of Europe). […]
LikeLike
Hey, would you consider uploading your map to Wikimedia Commons? We would like to feature your research in one of our reports for Wikidata on Meta, and include the image. But in order to do that, the image has to be up on Commons. If that’s not on option, don’t worry. We will simply link to the article.
LikeLike
done
Updating and recalculating this would be much easier with the Wikidata/Wiktionary community.
LikeLike
I am amazed by your work and your passion for languages, truly inspiring, thank you!
LikeLike
I am amazed by your work and your passion for languages, truly inspiring, thank you!
LikeLike
Hi there, I have just discovered your work that I find very inspiring.
I am currently preparing a lecture (in french) on the visualization of networks,
It would really be interesting to use this matrix of relationships as an exemple,
I can deduce the matrix from your nice visualization but if you agree to send me the original dataset,
It would be of a great help ! We can discuss by mail if you want more information on the content of the lecture.
LikeLike
The language spoken in Luxemburg is Luxemburgish but – more correctly – Letzeburgesch
LikeLike
[…] which list? Lexical Distance Among Languages of Europe used ISO 639-3 . It is maintained by SIL which also publicizes Ethnologue (ɠ). ISO 639-3 has 7 865 […]
LikeLike
Catalan and occità are way closer, they r actually the closest to each other.
Also, italy holds MANY languages, furlan and sardinian are just 2, there would be many, many more languages
LikeLike
Hello 🙂
Can You tell me why there is a significant difference between distances and positions among Slavic languages in 2 pictures:
Picture 1
Picture 2
Look e.g. at Slovak, Bulgarian positions and distances among other Slavic lang.
AlternativeTransport Edit: Linked Pictures and combined text from other question.
LikeLiked by 1 person
37 mln ukranians? Its too many. If you see its the whole population of Ukrain, but only half of them can speak ukranian. And its very optimistic assessment.
LikeLike
Excellent job, I would like to know, what those the number between an uncontinued lines means. Thank you.
LikeLike
In the Celtic group, Cornish (in South-West England) and Breton (in North West France) are very closely related (it is frequently reported that sailors from both regions were speaking mutually the same language easily with littel adaptation and were not even required to adapt their language to be fully understood; in fact there’s as littel differences between Cornish and Breton than between the 4 major variants of Breton itself, which also has several competing orthographic standards depending on these variants and the need or absence of need to distinguish local accents).
You could also add Manx in this group (also strongly linked to both Cornish and Irish Gaelic, and some Scotish Gaelic.
In the French group, Norman is considered a dialect of French (when historically it was not: French is born much later. The version of Norman that came to England after the Norman conquest explains the links that remain now between English and French, but Norman also includes Jersiais and Guernesiais, that are usually noit considered as part of French, even if they are strongly related to continental Norman in France). But Jersiais at least has now an official status in Jersey, and a locally approved dictionnary that ignored other forms used by Norman on the continent. In Wikipedia (depsite the fact that it has been using, is still using, a non-standard language code “nrm” in fact assigned to an unrelated language, even if now Norman has a standard code “nrf”, assigned at least for the continental form), Norman, Jersais and Guernésiais are not distinguished (Jersiais is still pending a request for a separate ISO 639-3 code instead of “nrf”, Guernésiais could follow as well).
LikeLike
El portugués debería estar enlazado con el gallego pues proviene del gallego. En la edad media Galicia se dividió en dos surgiendo Portugal. Ambos hablaban la misma lengua ya que hasta entonces todos eran gallegos . Con los siglos ambas lenguas evolucionaron, el gallego condicionado por el castellano, pero aún hoy el gallego y portugués son prácticamente iguales , como decir el español de España y el español de Argentina.
LikeLike
I dearly want this in large size printed for my linguist wife. Any chance one of the vector formats could be made available? Or that you could start selling prints?
LikeLike
Hi Larsrc,
If you want to gift it to a linguist then be careful, there are inaccuracies and unclear methodology which seem to pain some linguists. I have sent a version with a higher resolution to someone for a presentation, however not in a vector format.
I did a quick check what this would cost as a poster. A 48”x36″ poster on photo paper with shipping via USPS would cost about 154 USD. A 120cm x 90 cm photo paper with shipping within EU would cost about 96 EUR. For canvas, Sardisverlag sells maps printed on canvas for 60 EUR for an an unmodified rolled A1 print (59x84cm), up to 300 € for an individual version of an A0 (84cm x 119cm) print on stretcher frame. I guess they have a better price after ordering a bulk which they are selling now. If you dearly want it I can look into printing and sending a copy.
LikeLike
I think you should consider placing Catalan and Occitan closer together, or at the bare minimum connect them with a line. There is a great degree of mutual intelligibility between the two, particularly with regard to the more southerly dialects of Occitan.
See this post on Quora, for example: https://www.quora.com/Are-Occitan-Provençal-and-Catalan-mutually-intelligible
On Reddit the question is whether the two are the same language! https://www.reddit.com/r/linguistics/comments/43njkg/are_catalan_and_occitan_the_same_language/
LikeLiked by 1 person
Historically (in the Middle Age) Occitan and Catalan were the same language (so “Old Catalan” or “Old Occitan” are refering to the same language).
But they evolved separately in their modern form, and are now clearly separated. They have common roots, but distinct orthographies and pronunciations, and Catalan has borrowed more from from Castillan Spanish (more or less adapted) than Occitan that borrowed more forms from French (a lately created languages, which started initially with many imports from Occitan, but ignored the Catalan influence, before becoming a national standard in France only at the begining of the 20th century, when regional languages were still much more active than today: the offficial French became really predominant in France only after World War I that mixed a lot of populations from all regions on the battle fields or because of forced migrations across regions of the civil population).
Only at this time, Occitan and Catalan usage started to decline and this accelerated only after the 1960’s, but Occitan and Catalan remain separate forms in the same family of “Oc languages”. There’s as much difference between Occitan and Catalan as between French or Castillan Spanish. Anyway Occitan is not reallyt a single unified language, it has significant dialectal differences between variants spoken in Auvergne, or Provence or Italy, and some variants in Roussillon that also mixes some forms from Catalan.
If you think that Occitan and Catalan are the same language, you are making a confusion between “Occitan” (the modern regional language) and the “Oc languages” family (than includes all modern variants of Occitan (Essentially in France, Italy and Switzerland), all modern variants of Catalan (in France, Spain and Andorra), and old Occitan-Catalan (whose separation started in the Middle Age with the English occupation of Southwestern France while Catalan forms were protected by former catholic kingdoms in Spain before the “Reconquista” of Castille).
The histotic invasions in France and Spain explain their difference: you cannot mix and consider that modern Occitan and modern Catalan are the same language even if they belong to the same family and have a common root (Old Occitan=Old Catalan), which is also a clearly distinct language in that family.
LikeLiked by 1 person
I’m not claiming they are the same language, but rather I use the fact that a debate exists as evidence that the two are (still) very closely related – I would say considerably more so than the graphic indicates.
Linguists generally agree that Catalan is Occitan’s closest relative and vice versa. I speak Catalan as a second language, and certain varieties of Occitan are quite intelligible to me. And yet, you wouldn’t think so by looking at the graphic, where Occitan is displayed as closer to Portuguese than to Catalan. This just doesn’t seem right to me.
By the way, I believe that Catalan is not considered an oc language, as the word for “yes” is “sí”, not “oc”.
LikeLiked by 1 person
Yes, I agree Occitan and Catalan should be placed differently. The language nodes were placed according to the researched connections and not all connections were researched. Thus smaller less well researched languages have sometimes unfortunate node placements.
LikeLike
Wow! See what I’ve found via http://cnl.psych.cornell.edu/pubs/2016-bwhsc-PNAS.pdf — http://asjp.clld.org looks exactly like what you need.
LikeLike
Was der Max Planck Institute nicht alles unterstützt. Also, see these 9 other places http://linguistics.stackexchange.com/questions/11037/database-of-swadesh-lists/
LikeLike
In your upcoming updated version, if you compare word spelling, you will run into trouble with isophonic letter combinations like Slavic č and ч, Polish cz, Hungarian cs, Romanian ci, etc. If you compare the pronunciation, you will sometimes lose the etymology like with Polish ó and Czech ů pronounced as /u/ while derived from “o” or Russian unstressed vowels that, variously written, are only pronounced as /a i u/. That said, I think pronunciation is the way to go. I can help you compile a uniform pronunciation list for Slavic languages. Just tell me which version of Swadesh list you use.
LikeLiked by 1 person
I would be grateful and delighted for your help, I will clean up my list until now and share it.
LikeLike
What about Neapolitan language
LikeLike
Same remark about Corsican (Italic-Romance, nearer from italian and Sardinian than from French, with many lexical entities from Genovese dialects in Italy, but now increasing links to French as it is the legal language in Corsica; Corsican still has some regional recognition and not uncommon to find in various places of Corsica)
LikeLike
Yes, it would be nice to further split the larger bubbles (German, Italian, Spanish). To calculate those it would require their spelling, you can help! Consider completing a Word list here: https://en.wiktionary.org/wiki/Appendix:Swadesh_lists
LikeLike
I don’t like the fact of collecting these lists in separate unstructured pages of English Wiktionnary (in obscure links that native speakers won’t likely find of correct), when there’s the universal way to do that using pages for “Swadesh list” in each Wiktionary per language.
Corsican has its own Wiktionnary and you should be able to use Wikidata entries (or terms in “translation” section for each defined term in its page on the native wiktionnary, and then reverse links from the links to match them)
These Swadesh lists on EN.WIKT have many errors and are clearly insufficient and unmaintained: most contributors in EN.WIKT only know English; or their own language, but they won’t be able to navigate correctly to find these pages.
I think you’ll get much better measures by ignoring these lists of “translations” and use instead the trnaslations listed for each term and correctly linked across wikis via wikidata (and bots checking these translation sections and filling the gaps).
LikeLike
Hi Philippe,
I agree with you on a lot of that. I also would prefer Wikidata. I would prefer that the fields of an infobox of a language article in Wikipedia be populated by data from Wikidata. I would prefer that in Wiktionary the etymology, related terms, type of word, ect… all be automatically added to Wikidata. One could more easily produce etymology charts, calculate lexical distance, create bubble charts, ect… Sadly, those wishes are maybe a couple of years away. (Unless you can write a bot that does that?)
I am going to link the English Wikipedia and Wiktionary, because it is a de-facto lingua franca, it will be most likely the first or second to be corrected by a native speaker. Plus, about 65% of the people that read this blog are not from an English speaking country.
LikeLike
“Plus, about 65% of the people that read this blog are not from an English speaking country.” you say, but still you use only page in the English Wiktionary that these people will never read or contribute to even if there are errors or serious lacks of information about waht is said there about their own language
LikeLike
hi there, have you got some more interactive visualisations of this data?
LikeLike
Hi rafszul,
The graphic was originally done as .wmf file, but it can be transferred to any vector graphic file, e.g. to a .svg. Then, one could add hyperlinks to the individual labels or bubbles to, lets say their corresponding wiki article or Glottolog info. One could also embed that in java and have labels and extra information pop up next to a bubble if you hoover over it with your mouse pointer.
One could calculate an exact 2D or 3D position of each language node see: https://marcinciura.wordpress.com/2016/06/22/warping-maps-with-svd/
Or try out a https://gephi.org/ diagram.
One could add time, and have the bubble sizes vary and move around according to how their spelling assumed spelling changed, similar to a gapminder chart https://bost.ocks.org/mike/nations/. For example compare Gothic to Latin, then Medieval Germanic languages to Romance languages, and finally modern languages to each other, interpolate between data points and populate population lists.
Or one could assign population center points with coordinates on a map and then coordinates for their lexical relationships, then press a button and jump between those two, see eg. http://www.nytimes.com/interactive/2012/02/13/us/politics/2013-budget-proposal-graphic.html?_r=0 by shancarter or http://vallandingham.me/bubble_charts_in_d3.html by Jim Vallandingham.
BUT, first, most of this I can not include in a basic wordpress blog, second I am working on putting data together for a more complete and documented update and that takes up most of my linguistic time, and third I have no experience in programing any of these interactive visualization possibilities (apart from adding hyperlinks).
LikeLike
Also using Wikidata links, you’ll get support for many more languages. It is also much easier to process by bots to get more relevant statistics on many more terms than the very anglo-centered and too limited Swadesh list.
LikeLike
[…] common misunderstanding of the Lexical Distance Diagram is that a line between two languages means they are related more than they are to other languages […]
LikeLike
[…] Grammars blog also dissected this infographic and raised some issues not discussed here. Another post on this blog merely compares the Levenshtein Distance of a small list of words in different […]
LikeLike
Nice map and article, thanks!
LikeLike
Really nice piece of work – now getting some attention on Reddit.
However “Since Scots has no official status or clear boundary I did not include it. ”
No official status? According to Wikipedia:
Classified as a “traditional language” by the Scottish Government.
Classified as a “regional or minority language” under the European Charter for Regional or Minority Languages, ratified by the United Kingdom in 2001.
Classified as a “traditional language” by The North/South Language Body.
No clear boundary? Not sure what you mean by that. It has a *very* clear linguistic and physical boundary from English….
Anyway, it doesn’t really matter since just putting in on would be irrelevant – what would be nice would to have had the actual lexical connection research to see how Scots related. I think it might be surprising. Much more Germanic than English but also with Scandi language and Gaelic connections…?
LikeLike
Beautiful map and beautiful work. You have talent!!! I hope you make a full map of all language families some day.
LikeLike
French 60 million speakers seems low, just France has a population of 66 million people and you have to add Belgium.
LikeLike
Your estimate is better. The data from this version comes from here
https://en.wikipedia.org/w/index.php?title=Languages_of_Europe&oldid=648206584#Number_of_speakers
Before that section of the article was deleted it stated 65 million speakers.
LikeLike
Awesome map. I am very thankful. I can’t spot ‘lat’ for latin though, maybe hidden under the Italian circle?
LikeLike
Latin was mistakenly included by an earlier version but then left out by a this version.
LikeLike
Ladin is missing. It’s spoken by 30.000 people in Italy
LikeLike
Your map is awesome and I’m thankful for finding it. I can’t spot ‘lat’ though, maybe is hidden behind the Italian circle? 🙂
LikeLike
A cool body of work. I think that the larger Sor bubble is Upper Sorbian (Hsb, closer to Czech) while the smaller one is Lower Sorbian (Dsb, closer to Polish). The separate existence of Silesian is both dubious and recent. Also, how about renaming the conspicuous Mis1 to Crn?
LikeLike
Do you think the Sor bubbles are Upper and Lower on Tyshchenko’s originals? See Post: “how-much-does-language-change-when-it-travels”
As to Montenegrin, I went with what SIL defines in ISO 639-3, when I do an update of this and Montenegrin has an official abbreviation I will include it.
LikeLike
Yes. Although I cannot read the labels in the bubbles on the 1997 black-and-white diagram, look at the legend in the upper left corner. You’ll see “вл верхньолужицька” (Upper Lusatian) and “нл нижньолужицька” (Lower Lusatian). The “вл” circle must be the larger and closer to Czech one.
LikeLike
Alright, Silesian will be scratched and Sorbian split.
LikeLike
While you are at it, there’s also Kashubian (csb) with 100,000 speakers, somewhere near Polish, and perhaps Rusyn (rue) with 70,000 speakers, shifted a bit towards Slovak and Polish from Ukrainian, yet further from Belarusian than your current diagram would imply.
LikeLike
Well the next version I would like to do would include everything from Semitic across to Turkic and further South to Iranic based off of Tyshchenko’s larger version. But this is lexical distance, so to include Kashubian and Rusyn (Silesian, Luxembourgish) I should compile a shorter Swadesh–Yakhontov list and compare them to their neighbors.
LikeLiked by 1 person
Hello, I’m trying to do use these result for a sociological problem in some european cities
Is there some way to obtain the numerical data that has driven these graphics?
Thank you very much
LikeLike
I don’t have access to the original research, I can send you a matrix of relationships of the main graphic here.
LikeLike
That would be great!
Can you email me at
[removed]@[removed]
Thank you very much!
LikeLike
Have done so, if you did not get it leave a message here and I will try again.
LikeLike
[…] source for the lexical distances Lexical Distance Among Languages shown here are from Prof. Tyshchenko’s research. Sadly, I have not read any literature of his […]
LikeLike
I am looking for the lexical distance between German and Yiddish. Would anyone know of work done on that pair? I don’t really know how to read these charts for comparison, since the two language pairs that I know that are similar, aren’t listed here, namely German and Yiddish, and Hebrew and Aramaic. Both pairs are lexically close, but I just don’t know how close.
LikeLiked by 1 person
Hebrew and Aramaic would be interesting as well as other Semitic languages, I have no lexical distances of them. Yiddish is sadly quite small at present, I included Luxembourgish even though I did not have a distance and estimated where it would be, my estimation is Yiddish would be to the right of Luxembourgish, a little closer to German at about 10 lexical distance and on a line between German and Slavic branch. As to how to read these diagrams, I just put up a new blog post with my interpretation.
LikeLike
Thanks for a nice presentation!
Concerning Yiddish it seems I spotted it on the original langmap next to HIM (NEMetskij, German), abbreviated as ÏДД (Jiddiš, I’d say).
LikeLiked by 1 person
Jiddiš with to “D”s , with connections to German, Danish and Dutch (no lexical distance numbers). I guess you are correct, it bothers me that the bubble is so big, but the Frisian Bubble is also that big. Since there are no numbers, I will estimate it’s position and include it in the next update.
LikeLike
This is beautiful work… I’m going to have sit down and spend an hour with your map.
LikeLike
Thank you, I haven’t got round to adding Turkic but I hope I will manage soon. On the previous post you can see a Tyshchenko original with Turkic Languages.
LikeLike
[…] post described the troubles with mapping lexical distance in 2D and what issues I encountered when drawing this diagram. Playing around this weekend I came up with a proposal how to eliminate some of those […]
LikeLike
[…] This is a follow up to a previous post Lexical Distance Among Languages. […]
LikeLike
[…] Lexical Distance Among Languages of Europe 2015 | Alternative Transport […]
LikeLike