🤖 AI Summary
In historical linguistics, the genetic classification of isolates and unclassified languages has long relied on labor-intensive manual comparison. This paper proposes the first multimodal deep neural network model integrating lexical and grammatical features, trained on morphological and syntactic data from over 1,000 languages worldwide to enable automatic cross-family classification. Methodologically, it jointly encodes lexical items (e.g., cognate sets) and structural properties (e.g., word order, alignment) within a unified architecture. Key contributions include: (1) the first systematic integration of lexical and grammatical representations, empirically confirming lexical features’ dominance in deep genetic inference; (2) interpretable attribution analyses that uncover plausible genealogical links for isolates; and (3) substantial gains over unimodal baselines—particularly in detecting distant subfamily relationships and assigning preliminary classifications to undocumented languages. The framework establishes a scalable, interpretable computational paradigm for deep phylogenetic reconstruction.
📝 Abstract
In historical linguistics, the affiliation of languages to a common language family is traditionally carried out using a complex workflow that relies on manually comparing individual languages. Large-scale standardized collections of multilingual wordlists and grammatical language structures might help to improve this and open new avenues for developing automated language affiliation workflows. Here, we present neural network models that use lexical and grammatical data from a worldwide sample of more than 1,000 languages with known affiliations to classify individual languages into families. In line with the traditional assumption of most linguists, our results show that models trained on lexical data alone outperform models solely based on grammatical data, whereas combining both types of data yields even better performance. In additional experiments, we show how our models can identify long-ranging relations between entire subgroups, how they can be employed to investigate potential relatives of linguistic isolates, and how they can help us to obtain first hints on the affiliation of so far unaffiliated languages. We conclude that models for automated language affiliation trained on lexical and grammatical data provide comparative linguists with a valuable tool for evaluating hypotheses about deep and unknown language relations.