BLUCK: A Benchmark Dataset for Bengali Linguistic Understanding and Cultural Knowledge

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the lack of culturally and linguistically grounded evaluation benchmarks for Bengali—a medium-resource language—by introducing BLUCK, the first multiple-choice question (MCQ) benchmark specifically designed for indigenous geographical, historical, and linguistic knowledge (2,366 items across 23 thematic categories). Methodologically, we employ human-crafted questions, conduct cross-model benchmarking on nine state-of-the-art LLMs, and perform fine-grained domain attribution analysis. Key contributions include: (1) the first MCQ evaluation framework explicitly aligned with Bengali’s native cultural context and phonological features; (2) the first fine-grained cultural knowledge assessment benchmark for medium-resource languages; and (3) empirical evidence demonstrating significant LLM deficiencies in Bengali phonology and other subdomains, with overall moderate performance—establishing a reproducible, quantitative baseline for low-resource language alignment and culture-aware model enhancement.

Technology Category

Application Category

📝 Abstract

In this work, we introduce BLUCK, a new dataset designed to measure the performance of Large Language Models (LLMs) in Bengali linguistic understanding and cultural knowledge. Our dataset comprises 2366 multiple-choice questions (MCQs) carefully curated from compiled collections of several college and job level examinations and spans 23 categories covering knowledge on Bangladesh's culture and history and Bengali linguistics. We benchmarked BLUCK using 6 proprietary and 3 open-source LLMs - including GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, Llama-3.3-70B-Instruct, and DeepSeekV3. Our results show that while these models perform reasonably well overall, they, however, struggles in some areas of Bengali phonetics. Although current LLMs' performance on Bengali cultural and linguistic contexts is still not comparable to that of mainstream languages like English, our results indicate Bengali's status as a mid-resource language. Importantly, BLUCK is also the first MCQ-based evaluation benchmark that is centered around native Bengali culture, history, and linguistics.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' performance in Bengali linguistic understanding

Assessing LLMs' knowledge of Bengali culture and history

Identifying gaps in Bengali phonetics understanding by LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces BLUCK dataset for Bengali LLM evaluation

Benchmarks 9 LLMs on cultural and linguistic tasks

First MCQ-based benchmark for native Bengali context

🔎 Similar Papers

No similar papers found.

Authors to Follow