🤖 AI Summary
This paper addresses the core challenges of language identification in India’s multilingual setting—namely, script overlap (e.g., Devanagari), high lexical/phonetic similarity among languages, short and noisy texts, and frequent English–Indic code-mixing. Methodologically, we (1) construct and publicly release the first high-quality, manually annotated dataset comprising 230K sentences covering English and all 22 officially recognized Indian languages, with 18 languages newly introduced to the research community; and (2) propose a multi-granularity deep learning baseline that jointly models character-level and word-level features. Experiments demonstrate that our model achieves state-of-the-art performance on fine-grained Indian language identification, significantly improving accuracy for low-resource languages. The resulting robust language identification module serves as a reliable preprocessing component for downstream multilingual NLP tasks.
📝 Abstract
The language identification task is a crucial fundamental step in NLP. Often it serves as a pre-processing step for widely used NLP applications such as multilingual machine translation, information retrieval, question and answering, and text summarization. The core challenge of language identification lies in distinguishing languages in noisy, short, and code-mixed environments. This becomes even harder in case of diverse Indian languages that exhibit lexical and phonetic similarities, but have distinct differences. Many Indian languages share the same script making the task even more challenging. In this paper, we release a dataset of 230K sentences consisting of English and all 22 official Indian languages labeled with their language identifiers where data in most languages are newly created. We also develop and release robust baseline models using state-of-the-art approaches in machine learning and deep learning that can aid the research in this field. Our baseline models are comparable to the state-of-the-art models for the language identification task.