Video Evidence to Reasoning Efficient Video Understanding via Explicit Evidence Grounding

📅 2026-01-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the high computational cost and hallucination issues commonly encountered by large vision-language models in video reasoning. To this end, the authors propose the Chain-of-Evidence (CoE) framework, which decouples perception-based localization from reasoning to jointly optimize efficiency and reliability. The key innovations include a lightweight Evidence Grounding Module (EGM), a reinforcement learning–based evidence anchoring mechanism, and CoE-Instruct, a dual-annotated instruction-tuning dataset comprising 164k samples. Experimental results demonstrate that the proposed method achieves state-of-the-art performance across five benchmarks, including Video-MME and MVBench, significantly improving reasoning accuracy while effectively mitigating hallucinations.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) face a fundamental dilemma in video reasoning: they are caught between the prohibitive computational costs of verbose reasoning and the hallucination risks of efficient, ungrounded approaches. To resolve this, we introduce the Chain of Evidence (CoE), a novel framework that architecturally decouples and co-optimizes perceptual grounding and reasoning efficiency. CoE incorporates two core innovations: (1) A lightweight Evidence Grounding Module (EGM) that acts as a query-guided filter, dynamically identifying and extracting a compact set of high-fidelity visual evidence; and (2) An Evidence-Anchoring Protocol optimized via Reinforcement Learning. Crucially, we design a composite reward mechanism that enforces process alignment, compelling the model to strictly reference identified temporal anchors during deduction, thereby mitigating hallucinations. To enable this, we construct CoE-Instruct, a large-scale dataset (164k samples) featuring a novel dual-annotation schema for separate perception and reasoning supervision. Extensive experiments on five benchmarks, including Video-MME, MVBench, and VSI-Bench, demonstrate that CoE-enhanced models establish a new state-of-the-art. They significantly outperform existing methods in accuracy, proving CoE to be a powerful and practical paradigm for reliable video understanding.

Problem

Research questions and friction points this paper is trying to address.

video reasoning

hallucination

evidence grounding

vision-language models

reasoning efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain of Evidence

Evidence Grounding Module

Evidence-Anchoring Protocol

Reinforcement Learning

Video Understanding

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding

2024-02-20International Conference on Machine LearningCitations: 30

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

2024-10-04arXiv.orgCitations: 36

Authors to Follow