From Isolates to Families: Using Neural Networks for Automated Language Affiliation

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In historical linguistics, the genetic classification of isolates and unclassified languages has long relied on labor-intensive manual comparison. This paper proposes the first multimodal deep neural network model integrating lexical and grammatical features, trained on morphological and syntactic data from over 1,000 languages worldwide to enable automatic cross-family classification. Methodologically, it jointly encodes lexical items (e.g., cognate sets) and structural properties (e.g., word order, alignment) within a unified architecture. Key contributions include: (1) the first systematic integration of lexical and grammatical representations, empirically confirming lexical features’ dominance in deep genetic inference; (2) interpretable attribution analyses that uncover plausible genealogical links for isolates; and (3) substantial gains over unimodal baselines—particularly in detecting distant subfamily relationships and assigning preliminary classifications to undocumented languages. The framework establishes a scalable, interpretable computational paradigm for deep phylogenetic reconstruction.

Technology Category

Application Category

📝 Abstract
In historical linguistics, the affiliation of languages to a common language family is traditionally carried out using a complex workflow that relies on manually comparing individual languages. Large-scale standardized collections of multilingual wordlists and grammatical language structures might help to improve this and open new avenues for developing automated language affiliation workflows. Here, we present neural network models that use lexical and grammatical data from a worldwide sample of more than 1,000 languages with known affiliations to classify individual languages into families. In line with the traditional assumption of most linguists, our results show that models trained on lexical data alone outperform models solely based on grammatical data, whereas combining both types of data yields even better performance. In additional experiments, we show how our models can identify long-ranging relations between entire subgroups, how they can be employed to investigate potential relatives of linguistic isolates, and how they can help us to obtain first hints on the affiliation of so far unaffiliated languages. We conclude that models for automated language affiliation trained on lexical and grammatical data provide comparative linguists with a valuable tool for evaluating hypotheses about deep and unknown language relations.
Problem

Research questions and friction points this paper is trying to address.

Automate language family classification using neural networks.
Combine lexical and grammatical data for better performance.
Identify relations between language subgroups and isolates.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural networks automate language classification
Combines lexical and grammatical data
Identifies relationships among language subgroups
🔎 Similar Papers
No similar papers found.
F
Frederic Blum
1Department of Linguistic and Cultural Evolution, Max-Planck Institute for Evolutionary Anthropology, 04103 Leipzig, 2University of Passau, Chair of Multilingual Computational Linguistics, Passau, 94032, Germany
Steffen Herbold
Steffen Herbold
University of Passau
Johann-Mattis List
Johann-Mattis List
Professor of Multilingual Computational Linguistics, University of Passau
general linguisticshistorical linguisticsChinese linguisticscomputational linguistics