TIGER: Traceable Inference with Graph-Based Evidence Routing for Mitigating Hallucinations in Multimodal Generation

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This work addresses the prevalent issue of factually inconsistent hallucinations in multimodal generative models. To mitigate this, the authors propose TIGER, a framework that, during inference, constructs an input observation graph and an output claim graph, then computes a risk score for each generated claim based on graph-structural alignment. Only high-risk claims undergo localized correction while the backbone model remains frozen. TIGER is the first method to enable fact-level traceable hallucination rectification, preventing erroneous content from corrupting input interpretation and supporting fine-grained, claim-wise feedback ranking and scheduling. Experiments across four cross-modal tasks—spanning images, text, audio, and video—demonstrate that TIGER substantially reduces unsupported content without compromising original task performance and is compatible with diverse backbone architectures and multi-source information settings.

📝 Abstract

We study fact-level repair for multimodal generation, where a fluent output may contain specific facts that are not supported by the input. Existing inference-time repair methods often generate feedback by jointly conditioning on the input and the current output. This design has two limitations: hallucinated claims in the output can bias the model's interpretation of the input, and free-form feedback cannot be ranked or scheduled at the fact level. We present TIGER, an inference-time framework that redesigns feedback for localized repair. TIGER independently extracts an observation graph from the input and a claim graph from the current output, then assigns each claim a graph-conditioned risk score based on support and conflict. The model repairs selected high-risk claims while keeping the backbone frozen. We provide a convergence analysis showing that the expected total risk decreases geometrically to an explicit asymptotic bound under mild assumptions. Experiments across four cross-modal paths, including image-to-text, image+text-to-text, audio-to-text, and video-to-text, show that TIGER reduces unsupported content while preserving task quality. The gains hold across multiple backbones, and a CrisisFACTS case study suggests that the same repair mechanism can improve grounding in multi-source settings.

Problem

Research questions and friction points this paper is trying to address.

multimodal generation

hallucination

fact-level repair

unsupported content

inference-time repair

Innovation

Methods, ideas, or system contributions that make the work stand out.

graph-based evidence routing

fact-level repair

multimodal hallucination mitigation