BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This study addresses the lack of systematic evaluation of hallucinations in large language models (LLMs) for Bengali. It proposes the first fine-grained, multitask hallucination benchmark for Bengali, BenHalluBench, comprising 12,000 samples spanning four task categories—generative question answering, Bengali-English code-mixed QA, summarization, and reasoning—and covering 12 task-specific hallucination types. The work introduces a novel dual-track evaluation protocol and a calibrated metric, BenHalluScore, which jointly accounts for false positive rate and hallucination detection rate, while integrating three existing datasets. Experiments across seven mainstream models reveal BenHalluScore ranging from 7.72% to 55.42%, highlighting substantial disparities in hallucination calibration capabilities. Furthermore, chain-of-thought prompting fails to consistently improve discrimination performance, underscoring the limitations of single-track evaluation and prompt-only mitigation strategies.

📝 Abstract

Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large language models (LLMs) for Bengali. We introduce BenHalluEval, a fine-grained hallucination evaluation framework for Bengali covering four tasks: Generative Question Answering (GQA), Bangla-English Code-Mixed QA, Summarization, and Reasoning. We construct 12,000 hallucinated candidates using GPT-5.4 across twelve task-specific hallucination types, drawn from three existing Bengali datasets, and evaluate seven LLMs spanning reasoning-oriented, multilingual, and Bengali-centric categories under a dual-track protocol that independently measures false-positive rate on ground-truth instances (Track A) and hallucination detection rate on hallucinated candidates (Track B). To jointly penalise both failure modes and prevent inflated scores from uniform response bias, we propose BenHalluScore, a dual-track calibration metric that ranges from 7.72% to 55.42% across models and tasks, revealing substantial variation in hallucination calibration. Chain-of-thought prompting, applied as a mitigation strategy, shifts response distributions without consistently improving hallucination discrimination. BenHalluEval establishes the first dedicated hallucination benchmark for Bengali and highlights the inadequacy of single-track and prompting-only evaluation approaches for low-resource language settings. The dataset and code are available at https://anonymous.4open.science/r/BanglaHalluEval-EB77.

Problem

Research questions and friction points this paper is trying to address.

hallucination

Bengali

large language models

evaluation framework

low-resource languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

hallucination evaluation

Bengali LLMs

multi-task benchmark

dual-track calibration

low-resource languages

🔎 Similar Papers

No similar papers found.