TVI-CoT: Text-Visual Interleaved Chain-of-Thought Reasoning for Multimodal Understanding

📅 2026-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses a critical limitation in existing vision-language models, which cannot revisit visual features after initial encoding during chain-of-thought reasoning, thereby constraining fine-grained understanding and error correction. To overcome this, the authors propose a text–vision interleaved chain-of-thought framework that dynamically coordinates textual reasoning with conditional attention to image regions through learnable control tokens—<THINK>, <LOOK>, and <ANSWER>. This approach enables explicit alternation between perception and reasoning within multimodal inference, breaking away from the conventional “vision-blind” paradigm. The method achieves state-of-the-art performance across eight benchmarks, yielding significant gains on MMMU (+6.1%), MathVerse (+3.8%), MathVista (+3.4%), and ScienceQA (+3.4%).

📝 Abstract

Chain-of-thought (CoT) reasoning has proven effective for enhancing problem-solving in large language models. However, when applied to multimodal LLMs (MLLMs), existing CoT approaches suffer from a fundamental limitation: they perform reasoning entirely in text without accessing visual features during the reasoning process. After initial visual encoding, image information becomes inaccessible, forcing models to reason based solely on whatever was captured in the initial description, which forms a `vision-blind reasoning' paradigm that limits fine-grained visual extraction, error verification, and adaptive attention. We propose Text-Visual Interleaved Chain-of-Thought (TVI-CoT), a framework that enables explicit interleaving of textual reasoning and visual feature access through learnable control tokens <THINK>, <LOOK> and <ANSWER>. These tokens allow dynamic switching between reasoning and visual grounding, attending to relevant image regions conditioned on the evolving reasoning state. Experiments on eight benchmarks demonstrate state-of-the-art results among MLLM-based CoT methods and notable performance boost compared to the baseline: +6.1% on MMMU, +3.8% on MathVerse, +3.4% on MathVista, and +3.4% on ScienceQA. Code is available at https://github.com/hulianyuyy/TVI-CoT.

Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought

Multimodal LLMs

Visual Reasoning

Vision-Blind Reasoning

Visual Grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought

Multimodal Reasoning

Visual Grounding