Investigating the impact of sample size on cognate detection

Johann-Mattis List (Philipps-University /Marburg/, mattis.list@uni-marburg.de)

Journal of Language Relationship, № 11, 2014 - p.91-101

Abstract: The paper deals with the question of how many words are needed to successfully apply different methods for cognate detection. In order to investigate this question, a large gold standard consisting of 550 concepts translated into 4 languages (English, German, Dutch, and French) was compiled and divided into subsets of increasing sample size. Applying automatic methods for cognate detection on this gold standard shows that the accuracy of lan- guage-specific cognate detection methods clearly depends on the sample size. However, given that sample size depends on various different factors such as the genetic closeness of the languages or the degree of contact between the languages under investigation, no general lower or upper bound can be determined from the analysis.

Keywords: comparative method, lexicostatistics, etymology, computational linguistics

PDF

***

Supplementary materials

The zip-archive includes:
- readme.md, a short description of the data-format;
- ids.qlc, the gold standard in QLC-format.