🤖 AI Summary
Multimodal large language models (MLLMs) frequently generate fine-grained hallucinations—textual outputs inconsistent with image content—posing critical challenges to reliability and trustworthiness.
Method: This work introduces, for the first time, a unified task comprising hallucination text span localization, six-category error classification, and targeted editing. We construct VisionHall, the first large-scale dataset (27K samples) combining human annotation and image-conditioned synthetic data generation. Our approach employs an error-type-dependent modeling framework integrating detection and editing, implemented via multi-stage fine-tuning of sequence labeling and classification models, graph-structured hallucination type synthesis, and crowd-sourced annotation consistency verification.
Contribution/Results: On VisionHall, our method significantly outperforms GPT-4o and Llama-3.2, achieving +12.6% F1 in hallucination detection and +9.3% BLEU in hallucination editing. This establishes a novel paradigm for trustworthy MLLM evaluation and robust optimization.
📝 Abstract
Multimodal Large Language Models (MLLMs) often generate hallucinations, where the output deviates from the visual content. Given that these hallucinations can take diverse forms, detecting hallucinations at a fine-grained level is essential for comprehensive evaluation and analysis. To this end, we propose a novel task of multimodal fine-grained hallucination detection and editing for MLLMs. Moreover, we propose ZINA, a novel method that identifies hallucinated spans at a fine-grained level, classifies their error types into six categories, and suggests appropriate refinements. To train and evaluate models for this task, we constructed VisionHall, a dataset comprising 6.9k outputs from twelve MLLMs manually annotated by 211 annotators, and 20k synthetic samples generated using a graph-based method that captures dependencies among error types. We demonstrated that ZINA outperformed existing methods, including GPT-4o and LLama-3.2, in both detection and editing tasks.