An analysis of the topology and estimation of accuracy for lexicostatistical classifications (on the data of Slavic languages)
Mikhail Vasilyev (Institute of Linguistics of the Russian Academy of Sciences (Moscow);; Mikhail Saenko (Institute of Slavic Studies of the Russian Academy of Sciences (Moscow);
Journal of Language Relationship, № 18/3-4, 2020 - p.320-347
Today, lexicostatistical methods are widely used in comparative-historical linguistics to establish linguistic kinship and build genealogical classifications. In works by Russian comparative linguists the most common technique is construction of phylogenetic trees obtained with the aid of the Starling software, developed by Sergei Starostin at the end of the 20th century. Starostin’s algorithm was based on a modified method of ‘neighbor joining’ and yielded satisfactory or plausible results in the vast majority of cases. At the same time, many researchers have pointed out a number of significant shortcomings in the obtained classifications, the most serious of which are the instability of the tree caused by even minimal changes in the number of idioms, as well as detection of a large number of fictitious taxa and nodes that are poorly explained or even contradict existing concepts. This article provides a detailed examination of these shortcomings based on the example of a new lexicostatistical classification for 25 Slavic lects. Upon detailed analysis, we propose a special procedure that allows to minimize the negative effect of identified deficiencies on the structure of the tree, making use of statistical analysis of the resulting topology and capable of identifying unreliable nodes within it. The technique is simple enough to to be practically implemented in the form of an additional Starling component or a separate application.
Keywords: lexicostatistics, neighbor-joining method, genealogical classification, mean absolute deviation
Supplementary materials: Lexicostatistical matrices and phylogenetic trees