SHALE: A Scalable Benchmark for Fine-grained Hallucination Evaluation in LVLMs

📅 2025-08-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large vision-language models (LVLMs) suffer from pervasive faithfulness and factual hallucinations, yet existing evaluation benchmarks are predominantly coarse-grained and manually constructed, suffering from poor scalability and data leakage risks. Method: We propose SHALE—the first scalable, fine-grained hallucination benchmark, comprising over 30,000 image–instruction pairs. SHALE employs automated data construction, a hierarchical hallucination induction framework, and input semantic perturbation strategies to enable controllable, diverse, and leakage-free evaluation. It incorporates 12 visual perception dimensions and 6 knowledge domains, supporting multidimensional assessment under both clean and noisy input conditions. Contribution/Results: Extensive experiments reveal severe factual hallucination across 20+ state-of-the-art LVLMs and high sensitivity to semantic perturbations, thereby filling a critical gap in scalable, fine-grained hallucination evaluation.

Technology Category

Application Category

📝 Abstract
Despite rapid advances, Large Vision-Language Models (LVLMs) still suffer from hallucinations, i.e., generating content inconsistent with input or established world knowledge, which correspond to faithfulness and factuality hallucinations, respectively. Prior studies primarily evaluate faithfulness hallucination at a coarse level (e.g., object-level) and lack fine-grained analysis. Additionally, existing benchmarks rely on costly manual curation or reused public datasets, raising concerns about scalability and data leakage. To address these limitations, we propose an automated data construction pipeline that produces scalable, controllable, and diverse evaluation data. We also design a hierarchical hallucination induction framework with input perturbations to simulate realistic noisy scenarios. Integrating these designs, we construct SHALE, a Scalable HALlucination Evaluation benchmark designed to assess both faithfulness and factuality hallucinations via a fine-grained hallucination categorization scheme. SHALE comprises over 30K image-instruction pairs spanning 12 representative visual perception aspects for faithfulness and 6 knowledge domains for factuality, considering both clean and noisy scenarios. Extensive experiments on over 20 mainstream LVLMs reveal significant factuality hallucinations and high sensitivity to semantic perturbations.
Problem

Research questions and friction points this paper is trying to address.

Evaluates fine-grained hallucinations in LVLMs
Addresses scalability limitations in existing benchmarks
Assesses both faithfulness and factuality hallucination types
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated data construction pipeline for scalable evaluation
Hierarchical hallucination induction framework with input perturbations
Fine-grained categorization scheme for faithfulness and factuality hallucinations
🔎 Similar Papers
No similar papers found.
Bei Yan
Bei Yan
Northeastern University
Signal Processing
Z
Zhiyuan Chen
Key Laboratory of AI Safety of CAS, Institute of Computing Technology, Chinese Academy of Sciences (CAS), University of Chinese Academy of Sciences, Beijing, China
Yuecong Min
Yuecong Min
Institute of Computing Technology, Chinese Academy of Sciences
Sign Language ProcessingGesture Recognition
J
Jie Zhang
Key Laboratory of AI Safety of CAS, Institute of Computing Technology, Chinese Academy of Sciences (CAS), Beijing, China
J
Jiahao Wang
Trustworthy Technology and Engineering Laboratory, Huawei, Shenzhen, China
X
Xiaozhen Wang
Trustworthy Technology and Engineering Laboratory, Huawei, Shenzhen, China
Shiguang Shan
Shiguang Shan
Professor of Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionPattern RecognitionMachine LearningFace Recognition