VTI-CoT: Visual-Textual Interleaved Chain of Thought for Video Reasoning

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

This work addresses the limitations of existing video reasoning methods, which often overlook critical visual cues and struggle to model complex temporal dynamics and causal relationships during chain-of-thought (CoT) reasoning. To overcome these challenges, we propose a vision-text interleaved CoT reasoning framework that dynamically aligns reasoning steps with corresponding video frames. We further introduce a novel OCR-driven mechanism for compressing CoT supervision signals, substantially improving reasoning efficiency on long videos. Additionally, we construct a high-quality multimodal CoT dataset and develop an automated annotation pipeline. Under identical model scales, our approach achieves state-of-the-art performance while significantly accelerating training convergence and reducing inference overhead.

📝 Abstract

Video reasoning aims to understand complex temporal events and causal relationships within videos. Recently, Chain-of-Thought (CoT) has been introduced to this field to enhance reasoning accuracy. However, existing CoT-based video reasoning methods primarily rely on text-only information for logical deduction, overlooking critical visual information during the inference process. Inspired by the human cognitive mechanism of reviewing visual segments during inference, we propose VTI-CoT, a Visual-Textual Interleaved CoT framework. VTI-CoT integrates textual reasoning steps with corresponding visual frames. Given the scarcity of visual-textual interleaved CoT in existing datasets, we develop an automated annotation pipeline to construct high-quality multimodal CoT data. Further, reasoning over long-form videos entails increasingly long CoT token sequences, which severely hinders training convergence and efficiency. To address this, we employ Optical Character Recognition (OCR)-based compression techniques to compress CoT supervision signals into a single canvas. Experimental results demonstrate that VTI-CoT achieves state-of-the-art performance among models of the same parameter scale while significantly improving training efficiency.

Problem

Research questions and friction points this paper is trying to address.

video reasoning

Chain-of-Thought

visual-textual interleaving

multimodal reasoning

temporal understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual-Textual Interleaved CoT

multimodal reasoning

video understanding