Video Evidence to Reasoning Efficient Video Understanding via Explicit Evidence Grounding

๐Ÿ“… 2026-01-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the high computational cost and hallucination issues commonly encountered by large vision-language models in video reasoning. To this end, the authors propose the Chain-of-Evidence (CoE) framework, which decouples perception-based localization from reasoning to jointly optimize efficiency and reliability. The key innovations include a lightweight Evidence Grounding Module (EGM), a reinforcement learningโ€“based evidence anchoring mechanism, and CoE-Instruct, a dual-annotated instruction-tuning dataset comprising 164k samples. Experimental results demonstrate that the proposed method achieves state-of-the-art performance across five benchmarks, including Video-MME and MVBench, significantly improving reasoning accuracy while effectively mitigating hallucinations.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Vision-Language Models (LVLMs) face a fundamental dilemma in video reasoning: they are caught between the prohibitive computational costs of verbose reasoning and the hallucination risks of efficient, ungrounded approaches. To resolve this, we introduce the Chain of Evidence (CoE), a novel framework that architecturally decouples and co-optimizes perceptual grounding and reasoning efficiency. CoE incorporates two core innovations: (1) A lightweight Evidence Grounding Module (EGM) that acts as a query-guided filter, dynamically identifying and extracting a compact set of high-fidelity visual evidence; and (2) An Evidence-Anchoring Protocol optimized via Reinforcement Learning. Crucially, we design a composite reward mechanism that enforces process alignment, compelling the model to strictly reference identified temporal anchors during deduction, thereby mitigating hallucinations. To enable this, we construct CoE-Instruct, a large-scale dataset (164k samples) featuring a novel dual-annotation schema for separate perception and reasoning supervision. Extensive experiments on five benchmarks, including Video-MME, MVBench, and VSI-Bench, demonstrate that CoE-enhanced models establish a new state-of-the-art. They significantly outperform existing methods in accuracy, proving CoE to be a powerful and practical paradigm for reliable video understanding.
Problem

Research questions and friction points this paper is trying to address.

video reasoning
hallucination
evidence grounding
vision-language models
reasoning efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain of Evidence
Evidence Grounding Module
Evidence-Anchoring Protocol
Reinforcement Learning
Video Understanding
๐Ÿ”Ž Similar Papers
Y
Yanxiang Huang
Department of Applied Mathematics, The Hong Kong Polytechnic University
G
Guohua Gao
Department of Applied Mathematics, The Hong Kong Polytechnic University
Zhaoyang Wei
Zhaoyang Wei
University of Chinese Academy of Sciences
Computer visionPoint PromptPointly SupervisionWeakly SupervisionInteractive Perception
J
Jianyuan Ni
Department of Applied Mathematics, The Hong Kong Polytechnic University