ILID: Native Script Language Identification for Indian Languages

📅 2025-07-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the core challenges of language identification in India’s multilingual setting—namely, script overlap (e.g., Devanagari), high lexical/phonetic similarity among languages, short and noisy texts, and frequent English–Indic code-mixing. Methodologically, we (1) construct and publicly release the first high-quality, manually annotated dataset comprising 230K sentences covering English and all 22 officially recognized Indian languages, with 18 languages newly introduced to the research community; and (2) propose a multi-granularity deep learning baseline that jointly models character-level and word-level features. Experiments demonstrate that our model achieves state-of-the-art performance on fine-grained Indian language identification, significantly improving accuracy for low-resource languages. The resulting robust language identification module serves as a reliable preprocessing component for downstream multilingual NLP tasks.

Technology Category

Application Category

📝 Abstract
The language identification task is a crucial fundamental step in NLP. Often it serves as a pre-processing step for widely used NLP applications such as multilingual machine translation, information retrieval, question and answering, and text summarization. The core challenge of language identification lies in distinguishing languages in noisy, short, and code-mixed environments. This becomes even harder in case of diverse Indian languages that exhibit lexical and phonetic similarities, but have distinct differences. Many Indian languages share the same script making the task even more challenging. In this paper, we release a dataset of 230K sentences consisting of English and all 22 official Indian languages labeled with their language identifiers where data in most languages are newly created. We also develop and release robust baseline models using state-of-the-art approaches in machine learning and deep learning that can aid the research in this field. Our baseline models are comparable to the state-of-the-art models for the language identification task.
Problem

Research questions and friction points this paper is trying to address.

Identifying Indian languages in noisy, short texts
Distinguishing similar Indian languages sharing scripts
Creating datasets and models for language identification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset of 230K sentences for Indian languages
Baseline models using ML and DL approaches
State-of-the-art performance in language identification
🔎 Similar Papers
No similar papers found.
Y
Yash Ingle
Sardar Vallabhbhai National Institute of Technology, Surat, India
Pruthwik Mishra
Pruthwik Mishra
SVNIT, Surat
MLNLPCLMTWord Problem Solving