Benchmarking Bengali Dialectal Bias: A Multi-Stage Framework Integrating RAG-Based Translation and Human-Augmented RLAIF

📅 2026-03-22

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This study addresses the poor performance of large language models (LLMs) on low-resource dialects and the absence of effective evaluation frameworks. Focusing on nine Bengali dialects, the authors propose a two-stage evaluation pipeline: first, they employ retrieval-augmented generation (RAG) for dialect translation and assess translation fidelity using an LLM-as-a-judge approach complemented by human fallback; second, they evaluate question-answering performance via a multi-rater consensus-based RLAIF process. Key contributions include a novel translation quality assessment method for non-standardized dialects, a rigorously constructed dialect bias benchmark dataset, and a Critical Bias Sensitivity metric tailored for safety-critical applications. Experiments reveal that model performance on highly divergent dialects—such as Chittagonian (5.44/10)—lags significantly behind that on dominant variants like Tangail (7.68/10), and increasing model scale does not consistently mitigate such biases.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) frequently exhibit performance biases against regional dialects of low-resource languages. However, frameworks to quantify these disparities remain scarce. We propose a two-phase framework to evaluate dialectal bias in LLM question-answering across nine Bengali dialects. First, we translate and gold-label standard Bengali questions into dialectal variants adopting a retrieval-augmented generation (RAG) pipeline to prepare 4,000 question sets. Since traditional translation quality evaluation metrics fail on unstandardized dialects, we evaluate fidelity using an LLM-as-a-judge, which human correlation confirms outperforms legacy metrics. Second, we benchmark 19 LLMs across these gold-labeled sets, running 68,395 RLAIF evaluations validated through multi-judge agreement and human fallback. Our findings reveal severe performance drops linked to linguistic divergence. For instance, responses to the highly divergent Chittagong dialect score 5.44/10, compared to 7.68/10 for Tangail. Furthermore, increased model scale does not consistently mitigate this bias. We contribute a validated translation quality evaluation method, a rigorous benchmark dataset, and a Critical Bias Sensitivity (CBS) metric for safety-critical applications.

Problem

Research questions and friction points this paper is trying to address.

dialectal bias

low-resource languages

large language models

Bengali dialects

performance disparity

Innovation

Methods, ideas, or system contributions that make the work stand out.

dialectal bias

RAG-based translation

LLM-as-a-judge

RLAIF

Critical Bias Sensitivity

🔎 Similar Papers

No similar papers found.