ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models

📅 2025-07-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video understanding methods predominantly rely on text-only chain-of-thought (CoT) reasoning, neglecting dynamic visual modality engagement—contradicting human “see-while-thinking” cognition. Method: We propose Video-Text Interleaved Chain-of-Thought (ViTCoT), the first framework enabling fine-grained, adaptive alternation between visual and linguistic modalities within the reasoning chain: at each step, the model selectively attends to key-frame visual features or generates textual descriptions, establishing a closed-loop multimodal reasoning flow. To support this, we introduce ViTIB—a manually curated, cognitively aligned Video-Text Interleaved Benchmark—and design an end-to-end trainable ViTCoT architecture. Contribution/Results: ViTCoT significantly outperforms text-only CoT baselines across multiple complex video reasoning tasks. Neuroscientific analysis further reveals that it elicits broader, more discriminative neuron activations in multimodal large language models (MLLMs), establishing a novel paradigm for human-like multimodal reasoning.

Technology Category

Application Category

📝 Abstract
Video understanding plays a vital role in bridging low-level visual signals with high-level cognitive reasoning, and is fundamental to applications such as autonomous driving, embodied AI, and the broader pursuit of AGI. The rapid development of large language models (LLMs), particularly those utilizing Chain-of-Thought (CoT) technology, has significantly advanced video reasoning capabilities. However, current approaches primarily depend on textual information for reasoning, overlooking the visual modality in the actual video reasoning process. In contrast, humans naturally re-examine visual content while reasoning. Motivated by this, we introduce a novel video reasoning paradigm: Video-Text Interleaved CoT (ViTCoT), which facilitates more intuitive and cognitively aligned reasoning. To the end, first, we construct the Video-Text Interleaved Benchmark (ViTIB), which is created using MLLMs for key-video selection and manually verified. Furthermore, we extensively explore the potential of the ViTCoT paradigm in the video understanding field. Extensive experiments demonstrate that ViTCoT significantly enhances performance compared to the traditional text-only CoT paradigm and effectively activates more neuron values in MLLMs.
Problem

Research questions and friction points this paper is trying to address.

Enhancing video understanding by integrating visual and textual reasoning
Addressing the limitation of text-only reasoning in current video analysis
Developing a benchmark for evaluating video-text interleaved reasoning models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video-Text Interleaved CoT for cognitive reasoning
Constructed Video-Text Interleaved Benchmark (ViTIB)
Enhanced performance over text-only CoT paradigm
🔎 Similar Papers
No similar papers found.
Yongheng Zhang
Yongheng Zhang
M.S. Student @ CSU | Research Intern @ Tencent
Artificial IntelligenceLarge Language ModelWorld Model
X
Xu Liu
School of Computer Science and Engineering, Central South University, ChangSha, Hunan, China
R
Ruihan Tao
School of Computer Science and Engineering, Central South University, ChangSha, Hunan, China
Qiguang Chen
Qiguang Chen
Harbin Institute of Technology
Chain-of-ThoughtReasoningMultilingual LLMMulti-modal LLM
Hao Fei
Hao Fei
National University of Singapore
Vision and LanguageLarge Language ModelNatural Language ProcessingWorld Modeling
Wanxiang Che
Wanxiang Che
Professor of Harbin Institute of Technology
Natural Language Processing
L
Libo Qin
School of Computer Science and Engineering, Central South University, ChangSha, Hunan, China