Evaluating LLMs' Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis

📅 2025-07-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The absence of standardized multilingual evaluation benchmarks for Bangla hinders systematic assessment of large language models (LLMs) in this low-resource language. Method: We introduce the first open-source, multidimensional Bangla LLM benchmark—comprising eight translation tasks—and systematically evaluate ten mainstream open-source LLMs. We propose a low-resource–adapted evaluation framework integrating subword tokenization statistics, fine-grained error analysis, and cross-lingual performance comparison. Contributions/Results: (1) All models exhibit substantially lower accuracy on Bangla than on English, with particularly pronounced degradation in smaller-parameter models and the Mistral series; (2) Subword tokenization efficiency strongly and negatively correlates with task accuracy; (3) Architectures such as DeepSeek demonstrate superior cross-lingual robustness and stability. To foster community advancement, we publicly release all benchmark data, evaluation code, and results—aiming to accelerate NLP benchmark development and model optimization for low-resource languages.

Technology Category

Application Category

📝 Abstract
Bengali is an underrepresented language in NLP research. However, it remains a challenge due to its unique linguistic structure and computational constraints. In this work, we systematically investigate the challenges that hinder Bengali NLP performance by focusing on the absence of standardized evaluation benchmarks. We then evaluated 10 recent open source Large Language Models (LLMs) in 8 of the translated datasets and performed a comprehensive error analysis to pinpoint their primary failure modes. Our findings reveal consistent performance gaps for Bengali compared to English, particularly for smaller models and specific model families like Mistral. We also identified promising robustness in certain architectures, such as DeepSeek, that maintain more stable performance across languages. Our analysis reveals an inverse relationship between tokenization efficiency and LLM accuracy where models tend to perform worse when inputs are excessively tokenized, whereas more efficient & concise tokenization results in improved performance. These findings highlight critical areas where current models fall short and underscore the need for improved dataset quality and evaluation methodologies tailored to multilingual contexts. This work will catalyze further research on NLP for underrepresented languages, helping to democratize access to advanced language technologies worldwide. The code and dataset used in this research is publicly available at https://github.com/BengaliAI/bn-llm-benchmark.
Problem

Research questions and friction points this paper is trying to address.

Lack of standardized benchmarks for Bengali NLP evaluation
Performance gaps in LLMs for Bengali vs English
Tokenization efficiency inversely affects LLM accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created standardized Bengali evaluation benchmarks for LLMs
Evaluated 10 LLMs on 8 translated Bengali datasets
Analyzed tokenization efficiency impact on model accuracy
🔎 Similar Papers
No similar papers found.
S
Shimanto Bhowmik
Rochester Institute of Technology
Tawsif Tashwar Dipto
Tawsif Tashwar Dipto
Islamic University of Technology
Computer VisionNatural Language ProcessingLow Resource LanguagesAutomatic Speech Recognition
M
Md Sazzad Islam
Stanford University
Sheryl Hsu
Sheryl Hsu
Stanford University
T
Tahsin Reasat
Bengali.AI