URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base

📅 2024-09-27
🏛️ International Conference on Computational Linguistics
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
URIEL+ addresses limitations in the URIEL typological knowledge base regarding language coverage breadth, completeness of typological features, and linguistic plausibility of distance computation. Methodologically, it systematically completes typological feature annotations for 2,898 previously underserved languages by integrating multi-source data from WALS, Glottolog, and Ethnologue; introduces a user-configurable distance metric framework that combines weighted cosine similarity with feature imputation, supporting three-dimensional vector representations—geographic, genealogical, and typological. Its contributions include constructing the most inclusive typological knowledge base to date (covering 4,005 languages), substantially improving alignment between computed cross-lingual distances and empirical linguistic conclusions; and achieving downstream task performance competitive with state-of-the-art baselines. URIEL+ thus provides robust, interpretable, and linguistically grounded cross-lingual distance estimates for multilingual modeling and language evolution research.

Technology Category

Application Category

📝 Abstract
URIEL is a knowledge base offering geographical, phylogenetic, and typological vector representations for 7970 languages. It includes distance measures between these vectors for 4005 languages, which are accessible via the lang2vec tool. Despite being frequently cited, URIEL is limited in terms of linguistic inclusion and overall usability. To tackle these challenges, we introduce URIEL+, an enhanced version of URIEL and lang2vec that addresses these limitations. In addition to expanding typological feature coverage for 2898 languages, URIEL+ improves the user experience with robust, customizable distance calculations to better suit the needs of users. These upgrades also offer competitive performance on downstream tasks and provide distances that better align with linguistic distance studies.
Problem

Research questions and friction points this paper is trying to address.

Enhances linguistic inclusion in URIEL
Improves usability of lang2vec tool
Expands typological feature coverage
Innovation

Methods, ideas, or system contributions that make the work stand out.

Expanded typological feature coverage
Customizable distance calculations
Improved linguistic distance alignment
🔎 Similar Papers
No similar papers found.
A
Aditya Khan
University of Toronto, Canada
M
Mason Shipton
Ontario Tech University, Canada
David Anugraha
David Anugraha
Stanford University
Machine LearningNatural Language ProcessingMultimodalityArtificial Intelligence
K
Kaiyao Duan
University of Toronto, Canada
Phuong H. Hoang
Phuong H. Hoang
Oak Ridge National Lab
Grid ModelingEvidence TheoryMachine LearningOptimization and ControlPower Systems
E
Eric Khiu
University of Michigan, USA
A
A. Seza Doğruöz
LT3, IDLab, Universiteit Gent, Belgium
E
E. Lee
Ontario Tech University, Canada