NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models

πŸ“… 2025-11-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Video large language models (Video LLMs) often sacrifice visual factual accuracy for narrative coherence by incorporating narrative priors, leading to two fundamental error types: hallucination (fabricating non-existent events) and omission (overlooking actual events). To address this, we propose the first narrative-prior-driven error analysis framework and introduce NOAH, a large-scale benchmark. NOAH constructs composite video samples by inserting exogenous video clips and establishes a multi-task evaluation suite covering descriptive generation and existence-, temporal-, and narrative-aware question answering. We define customized metrics to quantify distinct error patterns. Evaluated on 60K+ samples across diverse models, our experiments systematically reveal strong model dependence on narrative priorsβ€”and demonstrate its degradation under low frame rates. NOAH provides a controllable, scalable assessment benchmark and delivers critical insights for building trustworthy vision-language models.

Technology Category

Application Category

πŸ“ Abstract
Video large language models (Video LLMs) have recently achieved strong performance on tasks such as captioning, summarization, and question answering. Many models and training methods explicitly encourage continuity across events to enhance narrative coherence. While this improves fluency, it also introduces an inductive bias that prioritizes storyline consistency over strict grounding in visual evidence. We identify this bias, which we call narrative prior, as a key driver of two errors: hallucinations, where non-existent events are introduced or existing ones are misinterpreted, and omissions, where factual events are suppressed because they are misaligned with surrounding context. To systematically evaluate narrative prior-induced errors, we introduce NOAH, a large-scale benchmark that constructs composite videos by inserting clips from other sources into target videos. By varying semantic similarity and insertion position, our benchmark enables controlled and scalable analysis of narrative priors. We design one captioning task with tailored metrics and three QA tasks - Existence, Temporal, and Narrative - yielding more than 60K evaluation samples. Extensive experiments yield three key findings: (i) most Video LLMs exhibit hallucinations and omissions driven by narrative priors, (ii) the patterns of these errors vary across architectures and depend on event similarity and insertion position, and (iii) reliance on narrative priors intensifies under sampling with fewer frames, amplifying errors when event continuity is weak. We establish NOAH as the first standardized evaluation of narrative prior-induced hallucination and omission in Video LLMs, providing a foundation for developing more reliable and trustworthy models. Our benchmark and code are available at https://anonymous550520.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Evaluating narrative prior-induced hallucinations in Video LLMs
Measuring omissions caused by storyline bias in video models
Benchmarking visual grounding errors in video language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructing composite videos by inserting clips
Varying semantic similarity and insertion position
Designing captioning and QA tasks with tailored metrics
πŸ”Ž Similar Papers
No similar papers found.
K
Kyuho Lee
Korea University
E
Euntae Kim
Korea University
J
Jinwoo Choi
Kyung Hee University
Buru Chang
Buru Chang
Korea University
Natural Language ProcessingMultimodal Machine Learning Data Mining