What Language is This? Ask Your Tokenizer

📅 2026-02-19

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

Current language identification systems suffer significant performance degradation in low-resource languages and closely related dialect scenarios. This work proposes UniLID, a method that treats tokenization as a language-specific process and leverages a Unigram language model to learn language-conditioned character distributions over a shared vocabulary. By doing so, UniLID enables incremental support for new languages without retraining. The approach substantially improves both sample efficiency and identification accuracy: it matches the performance of state-of-the-art tools on standard benchmarks and achieves over 70% accuracy on low-resource languages with only five labeled examples. Furthermore, UniLID demonstrates notable gains in fine-grained dialect identification tasks, highlighting its effectiveness in challenging linguistic settings.

Technology Category

Application Category

📝 Abstract

Language Identification (LID) is an important component of many multilingual natural language processing pipelines, where it facilitates corpus curation, training data analysis, and cross-lingual evaluation of large language models. Despite near-perfect performance on high-resource languages, existing systems remain brittle in low-resource and closely related language settings. We introduce UniLID, a simple and efficient LID method based on the UnigramLM tokenization algorithm, leveraging its probabilistic framing, parameter estimation technique and inference strategy. In short, we learn language-conditional unigram distributions over a shared tokenizer vocabulary but treat segmentation as a language-specific phenomenon. Our formulation is data- and compute-efficient, supports incremental addition of new languages without retraining existing models, and can naturally be integrated into existing language model tokenization pipelines. Empirical evaluations against widely used baselines, including fastText, GlotLID, and CLD3, show that UniLID achieves competitive performance on standard benchmarks, substantially improves sample efficiency in low-resource settings - surpassing 70% accuracy with as few as five labeled samples per language - and delivers large gains on fine-grained dialect identification.

Problem

Research questions and friction points this paper is trying to address.

Language Identification

low-resource languages

closely related languages

dialect identification

Innovation

Methods, ideas, or system contributions that make the work stand out.

UnigramLM

language identification

low-resource

tokenizer integration