TVI-CoT: Text-Visual Interleaved Chain-of-Thought Reasoning for Multimodal Understanding

📅 2026-06-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical limitation in existing vision-language models, which cannot revisit visual features after initial encoding during chain-of-thought reasoning, thereby constraining fine-grained understanding and error correction. To overcome this, the authors propose a text–vision interleaved chain-of-thought framework that dynamically coordinates textual reasoning with conditional attention to image regions through learnable control tokens—<THINK>, <LOOK>, and <ANSWER>. This approach enables explicit alternation between perception and reasoning within multimodal inference, breaking away from the conventional “vision-blind” paradigm. The method achieves state-of-the-art performance across eight benchmarks, yielding significant gains on MMMU (+6.1%), MathVerse (+3.8%), MathVista (+3.4%), and ScienceQA (+3.4%).
📝 Abstract
Chain-of-thought (CoT) reasoning has proven effective for enhancing problem-solving in large language models. However, when applied to multimodal LLMs (MLLMs), existing CoT approaches suffer from a fundamental limitation: they perform reasoning entirely in text without accessing visual features during the reasoning process. After initial visual encoding, image information becomes inaccessible, forcing models to reason based solely on whatever was captured in the initial description, which forms a `vision-blind reasoning' paradigm that limits fine-grained visual extraction, error verification, and adaptive attention. We propose Text-Visual Interleaved Chain-of-Thought (TVI-CoT), a framework that enables explicit interleaving of textual reasoning and visual feature access through learnable control tokens <THINK>, <LOOK> and <ANSWER>. These tokens allow dynamic switching between reasoning and visual grounding, attending to relevant image regions conditioned on the evolving reasoning state. Experiments on eight benchmarks demonstrate state-of-the-art results among MLLM-based CoT methods and notable performance boost compared to the baseline: +6.1% on MMMU, +3.8% on MathVerse, +3.4% on MathVista, and +3.4% on ScienceQA. Code is available at https://github.com/hulianyuyy/TVI-CoT.
Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought
Multimodal LLMs
Visual Reasoning
Vision-Blind Reasoning
Visual Grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought
Multimodal Reasoning
Visual Grounding
Interleaved Reasoning
Control Tokens
🔎 Similar Papers
No similar papers found.