Reliable Thinking with Images

📅 2026-02-13
📈 Citations: 0
Influential: 0
📄 PDF

Technology Category

Application Category

📝 Abstract
As a multimodal extension of Chain-of-Thought (CoT), Thinking with Images (TWI) has recently emerged as a promising avenue to enhance the reasoning capability of Multi-modal Large Language Models (MLLMs), which generates interleaved CoT by incorporating visual cues into the textual reasoning process. However, the success of existing TWI methods heavily relies on the assumption that interleaved image-text CoTs are faultless, which is easily violated in real-world scenarios due to the complexity of multimodal understanding. In this paper, we reveal and study a highly-practical yet under-explored problem in TWI, termed Noisy Thinking (NT). Specifically, NT refers to the imperfect visual cues mining and answer reasoning process. As the saying goes, ``One mistake leads to another'', erroneous interleaved CoT would cause error accumulation, thus significantly degrading the performance of MLLMs. To solve the NT problem, we propose a novel method dubbed Reliable Thinking with Images (RTWI). In brief, RTWI estimates the reliability of visual cues and textual CoT in a unified text-centric manner and accordingly employs robust filtering and voting modules to prevent NT from contaminating the final answer. Extensive experiments on seven benchmarks verify the effectiveness of RTWI against NT.
Problem

Research questions and friction points this paper is trying to address.

Noisy Thinking
Thinking with Images
Multimodal Large Language Models
Error Accumulation
Visual Cues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Noisy Thinking
Reliable Thinking with Images
Multimodal Reasoning
Chain-of-Thought
Error Accumulation
🔎 Similar Papers
No similar papers found.
H
Haobin Li
College of Computer Science, Sichuan University
Yutong Yang
Yutong Yang
Mercedes-Benz AG R&D & University of Stuttgart
Computer VisionAutonomous Driving
Yijie Lin
Yijie Lin
Sichuan University
Multi-modalMulti-view
Xiang Dai
Xiang Dai
CSIRO
Natural Language ProcessingDigital health
Mouxing Yang
Mouxing Yang
Sichuan University
Multi-modalMulti-viewNoisy Correspondence
X
Xi Peng
College of Computer Science, Sichuan University; National Key Laboratory of Fundamental Algorithms and Models for Engineering Simulation, Sichuan University