BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

📅 2026-05-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

183K/year
🤖 AI Summary
This study addresses the lack of systematic evaluation of hallucinations in large language models (LLMs) for Bengali. It proposes the first fine-grained, multitask hallucination benchmark for Bengali, BenHalluBench, comprising 12,000 samples spanning four task categories—generative question answering, Bengali-English code-mixed QA, summarization, and reasoning—and covering 12 task-specific hallucination types. The work introduces a novel dual-track evaluation protocol and a calibrated metric, BenHalluScore, which jointly accounts for false positive rate and hallucination detection rate, while integrating three existing datasets. Experiments across seven mainstream models reveal BenHalluScore ranging from 7.72% to 55.42%, highlighting substantial disparities in hallucination calibration capabilities. Furthermore, chain-of-thought prompting fails to consistently improve discrimination performance, underscoring the limitations of single-track evaluation and prompt-only mitigation strategies.
📝 Abstract
Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large language models (LLMs) for Bengali. We introduce BenHalluEval, a fine-grained hallucination evaluation framework for Bengali covering four tasks: Generative Question Answering (GQA), Bangla-English Code-Mixed QA, Summarization, and Reasoning. We construct 12,000 hallucinated candidates using GPT-5.4 across twelve task-specific hallucination types, drawn from three existing Bengali datasets, and evaluate seven LLMs spanning reasoning-oriented, multilingual, and Bengali-centric categories under a dual-track protocol that independently measures false-positive rate on ground-truth instances (Track A) and hallucination detection rate on hallucinated candidates (Track B). To jointly penalise both failure modes and prevent inflated scores from uniform response bias, we propose BenHalluScore, a dual-track calibration metric that ranges from 7.72% to 55.42% across models and tasks, revealing substantial variation in hallucination calibration. Chain-of-thought prompting, applied as a mitigation strategy, shifts response distributions without consistently improving hallucination discrimination. BenHalluEval establishes the first dedicated hallucination benchmark for Bengali and highlights the inadequacy of single-track and prompting-only evaluation approaches for low-resource language settings. The dataset and code are available at https://anonymous.4open.science/r/BanglaHalluEval-EB77.
Problem

Research questions and friction points this paper is trying to address.

hallucination
Bengali
large language models
evaluation framework
low-resource languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

hallucination evaluation
Bengali LLMs
multi-task benchmark
dual-track calibration
low-resource languages
🔎 Similar Papers
No similar papers found.