🤖 AI Summary
This work addresses the limitations of existing video reasoning methods, which often overlook critical visual cues and struggle to model complex temporal dynamics and causal relationships during chain-of-thought (CoT) reasoning. To overcome these challenges, we propose a vision-text interleaved CoT reasoning framework that dynamically aligns reasoning steps with corresponding video frames. We further introduce a novel OCR-driven mechanism for compressing CoT supervision signals, substantially improving reasoning efficiency on long videos. Additionally, we construct a high-quality multimodal CoT dataset and develop an automated annotation pipeline. Under identical model scales, our approach achieves state-of-the-art performance while significantly accelerating training convergence and reducing inference overhead.
📝 Abstract
Video reasoning aims to understand complex temporal events and causal relationships within videos. Recently, Chain-of-Thought (CoT) has been introduced to this field to enhance reasoning accuracy. However, existing CoT-based video reasoning methods primarily rely on text-only information for logical deduction, overlooking critical visual information during the inference process. Inspired by the human cognitive mechanism of reviewing visual segments during inference, we propose VTI-CoT, a Visual-Textual Interleaved CoT framework. VTI-CoT integrates textual reasoning steps with corresponding visual frames. Given the scarcity of visual-textual interleaved CoT in existing datasets, we develop an automated annotation pipeline to construct high-quality multimodal CoT data. Further, reasoning over long-form videos entails increasingly long CoT token sequences, which severely hinders training convergence and efficiency. To address this, we employ Optical Character Recognition (OCR)-based compression techniques to compress CoT supervision signals into a single canvas. Experimental results demonstrate that VTI-CoT achieves state-of-the-art performance among models of the same parameter scale while significantly improving training efficiency.