🤖 AI Summary
Existing cross-linguistic colexification databases suffer from imbalanced language family coverage, absent phonetic transcription, and weak cross-linguistic comparability. To address these limitations, this study introduces a next-generation standardized colexification database. Methodologically, we propose a novel workflow integrating phylogenetically balanced sampling with full-scale IPA transcription, augmented by structured data modeling and versioned quality assessment. The resulting resource expands language family coverage by 42%, achieves 100% IPA standardization for all lexical forms, and substantially enhances cross-linguistic comparability and computational usability. For the first time, it unifies breadth—encompassing 32 global language families—with precision—providing fine-grained phonemic representations. This database has become a benchmark dataset across multiple disciplines, including linguistic typology, historical linguistics, psycholinguistics, and computational linguistics.
📝 Abstract
Lexical resources are crucial for cross-linguistic analysis and can provide new insights into computational models for natural language learning. Here, we present an advanced database for comparative studies of words with multiple meanings, a phenomenon known as colexification. The new version includes improvements in the handling, selection and presentation of the data. We compare the new database with previous versions and find that our improvements provide a more balanced sample covering more language families worldwide, with an enhanced data quality, given that all word forms are provided in phonetic transcription. We conclude that the new Database of Cross-Linguistic Colexifications has the potential to inspire exciting new studies that link cross-linguistic data to open questions in linguistic typology, historical linguistics, psycholinguistics, and computational linguistics.