🤖 AI Summary
This work addresses the challenge of “near-plausible” hallucinations in long-form multimodal assistants, which often arise from incorrect binding of factual evidence—such as speaker identity, temporal context, or modality—and are difficult to detect with conventional evaluation protocols. To tackle this, the authors propose a counterfactual event-binding protocol that constructs paired supportive and counterfactual claims grounded in identical audiovisual evidence, enabling rigorous assessment via pairwise accuracy. They further introduce a modality-perturbation reliability calibration framework with a frozen backbone to enhance the model’s ability to judge claim support. Evaluated on the newly curated OmniHalluc-L benchmark, Qwen2.5-Omni-7B and Qwen3-Omni-Instruct achieve pairwise accuracies of 36.22% and 51.09%, respectively, and demonstrate gains of 2.20 and 1.51 percentage points in multiple-choice accuracy on OmniVideoBench and WorldSense.
📝 Abstract
Long-video Omni assistants often fail not by inventing content, but by misbinding real evidence: they hear the right utterance and see the right event, yet attach it to the wrong speaker, moment, or modality. These \emph{almost-true} errors evade standard video QA because local evidence remains valid, so item-level scoring can reward both a supported claim and its near-counterfactual. We introduce a counterfactual event-binding protocol that constructs paired supported/counterfactual claims from the same audio-visual event evidence and evaluates them by strict-pair accuracy. We instantiate it as \bench, a benchmark for long-video Omni hallucination, with 3{,}600 single-claim QA items from 638 long-form videos averaging 24.16 minutes and covering 256.87 hours. Under this protocol, open-weight Omni models remain weak at pair-level binding: Qwen2.5-Omni-7B reaches 32.06\% and Qwen3-Omni-Instruct reaches 41.55\%, versus 76.54\% for a closed-source reference. To narrow this gap without updating the backbone, we propose \method, Modality-Perturbation Reliability Calibration, a frozen-backbone framework that selects audio-negative probes within video-level folds and fuses their response shifts with native audio-visual confidence into per-claim support estimates. \method lifts Qwen2.5-Omni-7B to 36.22\% and Qwen3 to 51.09\% on \bench, and improves target-adapted MCQ accuracy on OmniVideoBench ($+$2.20) and WorldSense ($+$1.51) with Qwen3.