Analysis of Indic Language Capabilities in LLMs

📅 2025-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of comprehensive evaluation of large language models (LLMs) for Indian languages. We systematically assess 28 LLMs across 22 Indian languages on understanding and generation tasks to identify priority languages suitable for safety benchmarking. Our methodology introduces the first multi-dimensional, cross-model and cross-dataset quantitative analysis—covering linguistic performance, training data provenance, licensing terms, access modalities, and developer affiliations. Results reveal Hindi as the most widely supported language, yet substantial performance disparities exist across others; while the top five languages roughly align with native speaker population size, this correlation breaks down thereafter—exposing a critical misalignment between language coverage and actual user demographics. The work proposes the first holistic evaluation framework tailored to India’s multilingual ecosystem, offering empirical grounding and methodological rigor for equitable language assessment and resource allocation in LLM development.

Technology Category

Application Category

📝 Abstract
This report evaluates the performance of text-in text-out Large Language Models (LLMs) to understand and generate Indic languages. This evaluation is used to identify and prioritize Indic languages suited for inclusion in safety benchmarks. We conduct this study by reviewing existing evaluation studies and datasets; and a set of twenty-eight LLMs that support Indic languages. We analyze the LLMs on the basis of the training data, license for model and data, type of access and model developers. We also compare Indic language performance across evaluation datasets and find that significant performance disparities in performance across Indic languages. Hindi is the most widely represented language in models. While model performance roughly correlates with number of speakers for the top five languages, the assessment after that varies.
Problem

Research questions and friction points this paper is trying to address.

Multilingual Capability
Indian Languages
Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Indian Languages
Large Language Models
Performance Analysis
🔎 Similar Papers
No similar papers found.
A
Aatman Vaidya
Tattle Civic Tech, India
T
Tarunima Prabhakar
Tattle Civic Tech, India
D
Denny George
Tattle Civic Tech, India
Swair Shah
Swair Shah
Amazon.com
Machine Learning