See More, Think Deeper: Query-Expanded Visual Evidence and Answer-Clue Guided Reflection for Long Video Understanding

πŸ“… 2026-06-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current video large language models (Video-LLMs) are limited in long-form video understanding due to their reliance on a single evidence acquisition strategy and the absence of a visually grounded answer verification mechanism. This work proposes CoVER, a novel framework that introduces, for the first time, an evidence-centric paradigm coupled with a visually verifiable reasoning process. CoVER dynamically retrieves multi-source visual evidence through query expansion to β€œsee more,” while simultaneously employing answer-guided visual feedback to iteratively reflect and validate its reasoning to β€œthink deeper.” The resulting end-to-end Video-LLM achieves significant performance gains over existing open-source models at comparable parameter scales and even surpasses leading closed-source systems on several key metrics.
πŸ“ Abstract
Recent advances in Video Large Language Models (Video-LLMs) have enabled performance on long-video understanding tasks. However, existing methods still face two key limitations: evidence acquisition often relies on a single search intent, and answer generation lacks an effective visual feedback mechanism. To address these limitations, we propose \textbf{CoVER}, a Comprehensive Visual Evidence and Reflection framework for long-video understanding. CoVER enables Video-LLMs to \textbf{See More} by dynamically gathering query-expanded visual evidence, and \textbf{Think Deeper} by verifying draft answers with effective answer-specific visual feedback. Together, these mechanisms shift long-video understanding from answer-centric generation to evidence-centric and visually verifiable reasoning. Experimental results show that CoVER-7B substantially outperforms models with the same parameter scale and even surpasses state-of-the-art closed-source models on certain metrics.
Problem

Research questions and friction points this paper is trying to address.

long-video understanding
visual evidence
answer generation
visual feedback
Video-LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video-LLMs
query-expanded visual evidence
answer-clue guided reflection
evidence-centric reasoning
long video understanding