Chitrarth: Bridging Vision and Language for a Billion People

📅 2025-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal foundation models heavily rely on English or high-resource European language data, resulting in poor generalization to low- and medium-resource languages—particularly in India’s linguistically diverse settings. To address this, we introduce Chitrarth, the first inclusive vision-language model tailored to ten major Indian languages. Our approach features: (1) a unified multilingual vision-language architecture integrating a state-of-the-art multilingual large language model with a visual encoder; (2) BharatBench—the first open, India-specific multimodal evaluation benchmark; and (3) training on high-quality, multilingual image–text alignment data. Experiments demonstrate that Chitrarth achieves state-of-the-art performance across multiple low-resource language benchmarks while maintaining competitive accuracy on English tasks. Notably, it significantly advances fairness and effectiveness in multilingual visual understanding, bridging critical gaps in cross-lingual multimodal representation learning.

Technology Category

Application Category

📝 Abstract
Recent multimodal foundation models are primarily trained on English or high resource European language data, which hinders their applicability to other medium and low-resource languages. To address this limitation, we introduce Chitrarth (Chitra: Image; Artha: Meaning), an inclusive Vision-Language Model (VLM), specifically targeting the rich linguistic diversity and visual reasoning across 10 prominent Indian languages. Our model effectively integrates a state-of-the-art (SOTA) multilingual Large Language Model (LLM) with a vision module, primarily trained on multilingual image-text data. Furthermore, we also introduce BharatBench, a comprehensive framework for evaluating VLMs across various Indian languages, ultimately contributing to more diverse and effective AI systems. Our model achieves SOTA results for benchmarks across low resource languages while retaining its efficiency in English. Through our research, we aim to set new benchmarks in multilingual-multimodal capabilities, offering substantial improvements over existing models and establishing a foundation to facilitate future advancements in this arena.
Problem

Research questions and friction points this paper is trying to address.

Multilingual Vision-Language Model
Low-resource Indian languages
Multimodal AI diversity improvement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual Vision-Language Model
Integrates LLM with vision
Introduces BharatBench evaluation framework
🔎 Similar Papers
No similar papers found.
Shaharukh Khan
Shaharukh Khan
Unknown affiliation
Machine LearningVLM
A
Ayush K Tarun
Krutrim AI, Bangalore, India
A
Abhinav Ravi
Krutrim AI, Bangalore, India
Ali Faraz
Ali Faraz
Data Scientist, Krutrim
Machine LearningLLMsLVMsComputer Vision
A
Akshat Patidar
Krutrim AI, Bangalore, India
P
Praveen Pokala
Krutrim AI, Bangalore, India
Anagha Bhangare
Anagha Bhangare
IIT Bombay | Applied AI, OLA Krutrim
Vision Language Models
R
Raja Kolla
Krutrim AI, Bangalore, India
Chandra Khatri
Chandra Khatri
Ola Krutrim AI
Artificial IntelligenceMulti-Modal AIConversational AIDeep LearningMachine Learning
S
Shubham Agarwal
Krutrim AI, Bangalore, India