See More, Think Deeper: Query-Expanded Visual Evidence and Answer-Clue Guided Reflection for Long Video Understanding

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current video large language models (Video-LLMs) are limited in long-form video understanding due to their reliance on a single evidence acquisition strategy and the absence of a visually grounded answer verification mechanism. This work proposes CoVER, a novel framework that introduces, for the first time, an evidence-centric paradigm coupled with a visually verifiable reasoning process. CoVER dynamically retrieves multi-source visual evidence through query expansion to “see more,” while simultaneously employing answer-guided visual feedback to iteratively reflect and validate its reasoning to “think deeper.” The resulting end-to-end Video-LLM achieves significant performance gains over existing open-source models at comparable parameter scales and even surpasses leading closed-source systems on several key metrics.

📝 Abstract

Recent advances in Video Large Language Models (Video-LLMs) have enabled performance on long-video understanding tasks. However, existing methods still face two key limitations: evidence acquisition often relies on a single search intent, and answer generation lacks an effective visual feedback mechanism. To address these limitations, we propose \textbf{CoVER}, a Comprehensive Visual Evidence and Reflection framework for long-video understanding. CoVER enables Video-LLMs to \textbf{See More} by dynamically gathering query-expanded visual evidence, and \textbf{Think Deeper} by verifying draft answers with effective answer-specific visual feedback. Together, these mechanisms shift long-video understanding from answer-centric generation to evidence-centric and visually verifiable reasoning. Experimental results show that CoVER-7B substantially outperforms models with the same parameter scale and even surpasses state-of-the-art closed-source models on certain metrics.

Problem

Research questions and friction points this paper is trying to address.

long-video understanding

visual evidence

answer generation

visual feedback

Video-LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video-LLMs

query-expanded visual evidence

answer-clue guided reflection