ZINA: Multimodal Fine-grained Hallucination Detection and Editing

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Multimodal large language models (MLLMs) frequently generate fine-grained hallucinations—textual outputs inconsistent with image content—posing critical challenges to reliability and trustworthiness. Method: This work introduces, for the first time, a unified task comprising hallucination text span localization, six-category error classification, and targeted editing. We construct VisionHall, the first large-scale dataset (27K samples) combining human annotation and image-conditioned synthetic data generation. Our approach employs an error-type-dependent modeling framework integrating detection and editing, implemented via multi-stage fine-tuning of sequence labeling and classification models, graph-structured hallucination type synthesis, and crowd-sourced annotation consistency verification. Contribution/Results: On VisionHall, our method significantly outperforms GPT-4o and Llama-3.2, achieving +12.6% F1 in hallucination detection and +9.3% BLEU in hallucination editing. This establishes a novel paradigm for trustworthy MLLM evaluation and robust optimization.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) often generate hallucinations, where the output deviates from the visual content. Given that these hallucinations can take diverse forms, detecting hallucinations at a fine-grained level is essential for comprehensive evaluation and analysis. To this end, we propose a novel task of multimodal fine-grained hallucination detection and editing for MLLMs. Moreover, we propose ZINA, a novel method that identifies hallucinated spans at a fine-grained level, classifies their error types into six categories, and suggests appropriate refinements. To train and evaluate models for this task, we constructed VisionHall, a dataset comprising 6.9k outputs from twelve MLLMs manually annotated by 211 annotators, and 20k synthetic samples generated using a graph-based method that captures dependencies among error types. We demonstrated that ZINA outperformed existing methods, including GPT-4o and LLama-3.2, in both detection and editing tasks.

Problem

Research questions and friction points this paper is trying to address.

Detecting fine-grained hallucinations in MLLM outputs

Classifying hallucination errors into six categories

Editing hallucinations to improve MLLM accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained hallucination detection in MLLMs

Error classification into six categories

Graph-based synthetic dataset generation

🔎 Similar Papers

No similar papers found.