ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models

📅 2025-07-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing video understanding methods predominantly rely on text-only chain-of-thought (CoT) reasoning, neglecting dynamic visual modality engagement—contradicting human “see-while-thinking” cognition. Method: We propose Video-Text Interleaved Chain-of-Thought (ViTCoT), the first framework enabling fine-grained, adaptive alternation between visual and linguistic modalities within the reasoning chain: at each step, the model selectively attends to key-frame visual features or generates textual descriptions, establishing a closed-loop multimodal reasoning flow. To support this, we introduce ViTIB—a manually curated, cognitively aligned Video-Text Interleaved Benchmark—and design an end-to-end trainable ViTCoT architecture. Contribution/Results: ViTCoT significantly outperforms text-only CoT baselines across multiple complex video reasoning tasks. Neuroscientific analysis further reveals that it elicits broader, more discriminative neuron activations in multimodal large language models (MLLMs), establishing a novel paradigm for human-like multimodal reasoning.

Technology Category

Application Category

📝 Abstract

Video understanding plays a vital role in bridging low-level visual signals with high-level cognitive reasoning, and is fundamental to applications such as autonomous driving, embodied AI, and the broader pursuit of AGI. The rapid development of large language models (LLMs), particularly those utilizing Chain-of-Thought (CoT) technology, has significantly advanced video reasoning capabilities. However, current approaches primarily depend on textual information for reasoning, overlooking the visual modality in the actual video reasoning process. In contrast, humans naturally re-examine visual content while reasoning. Motivated by this, we introduce a novel video reasoning paradigm: Video-Text Interleaved CoT (ViTCoT), which facilitates more intuitive and cognitively aligned reasoning. To the end, first, we construct the Video-Text Interleaved Benchmark (ViTIB), which is created using MLLMs for key-video selection and manually verified. Furthermore, we extensively explore the potential of the ViTCoT paradigm in the video understanding field. Extensive experiments demonstrate that ViTCoT significantly enhances performance compared to the traditional text-only CoT paradigm and effectively activates more neuron values in MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Enhancing video understanding by integrating visual and textual reasoning

Addressing the limitation of text-only reasoning in current video analysis

Developing a benchmark for evaluating video-text interleaved reasoning models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video-Text Interleaved CoT for cognitive reasoning

Constructed Video-Text Interleaved Benchmark (ViTIB)

Enhanced performance over text-only CoT paradigm

🔎 Similar Papers

No similar papers found.

Authors to Follow