π€ AI Summary
Current video large language models (Video-LLMs) are limited in long-form video understanding due to their reliance on a single evidence acquisition strategy and the absence of a visually grounded answer verification mechanism. This work proposes CoVER, a novel framework that introduces, for the first time, an evidence-centric paradigm coupled with a visually verifiable reasoning process. CoVER dynamically retrieves multi-source visual evidence through query expansion to βsee more,β while simultaneously employing answer-guided visual feedback to iteratively reflect and validate its reasoning to βthink deeper.β The resulting end-to-end Video-LLM achieves significant performance gains over existing open-source models at comparable parameter scales and even surpasses leading closed-source systems on several key metrics.
π Abstract
Recent advances in Video Large Language Models (Video-LLMs) have enabled performance on long-video understanding tasks. However, existing methods still face two key limitations: evidence acquisition often relies on a single search intent, and answer generation lacks an effective visual feedback mechanism. To address these limitations, we propose \textbf{CoVER}, a Comprehensive Visual Evidence and Reflection framework for long-video understanding. CoVER enables Video-LLMs to \textbf{See More} by dynamically gathering query-expanded visual evidence, and \textbf{Think Deeper} by verifying draft answers with effective answer-specific visual feedback. Together, these mechanisms shift long-video understanding from answer-centric generation to evidence-centric and visually verifiable reasoning. Experimental results show that CoVER-7B substantially outperforms models with the same parameter scale and even surpasses state-of-the-art closed-source models on certain metrics.