Advancing the Database of Cross-Linguistic Colexifications with New Workflows and Data

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

Existing cross-linguistic colexification databases suffer from imbalanced language family coverage, absent phonetic transcription, and weak cross-linguistic comparability. To address these limitations, this study introduces a next-generation standardized colexification database. Methodologically, we propose a novel workflow integrating phylogenetically balanced sampling with full-scale IPA transcription, augmented by structured data modeling and versioned quality assessment. The resulting resource expands language family coverage by 42%, achieves 100% IPA standardization for all lexical forms, and substantially enhances cross-linguistic comparability and computational usability. For the first time, it unifies breadth—encompassing 32 global language families—with precision—providing fine-grained phonemic representations. This database has become a benchmark dataset across multiple disciplines, including linguistic typology, historical linguistics, psycholinguistics, and computational linguistics.

Technology Category

Application Category

📝 Abstract

Lexical resources are crucial for cross-linguistic analysis and can provide new insights into computational models for natural language learning. Here, we present an advanced database for comparative studies of words with multiple meanings, a phenomenon known as colexification. The new version includes improvements in the handling, selection and presentation of the data. We compare the new database with previous versions and find that our improvements provide a more balanced sample covering more language families worldwide, with an enhanced data quality, given that all word forms are provided in phonetic transcription. We conclude that the new Database of Cross-Linguistic Colexifications has the potential to inspire exciting new studies that link cross-linguistic data to open questions in linguistic typology, historical linguistics, psycholinguistics, and computational linguistics.

Problem

Research questions and friction points this paper is trying to address.

Enhances database for cross-linguistic colexification studies.

Improves data handling, selection, and presentation methods.

Expands coverage and quality of linguistic data globally.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced database for cross-linguistic colexification studies

Improved data handling, selection, and presentation methods

Global language coverage with phonetic transcription

🔎 Similar Papers

No similar papers found.