Reliable Thinking with Images

📅 2026-02-13

📈 Citations: 0

✨ Influential: 0

📄 PDF

career value

160K/year

Technology Category

Application Category

📝 Abstract

As a multimodal extension of Chain-of-Thought (CoT), Thinking with Images (TWI) has recently emerged as a promising avenue to enhance the reasoning capability of Multi-modal Large Language Models (MLLMs), which generates interleaved CoT by incorporating visual cues into the textual reasoning process. However, the success of existing TWI methods heavily relies on the assumption that interleaved image-text CoTs are faultless, which is easily violated in real-world scenarios due to the complexity of multimodal understanding. In this paper, we reveal and study a highly-practical yet under-explored problem in TWI, termed Noisy Thinking (NT). Specifically, NT refers to the imperfect visual cues mining and answer reasoning process. As the saying goes, ``One mistake leads to another'', erroneous interleaved CoT would cause error accumulation, thus significantly degrading the performance of MLLMs. To solve the NT problem, we propose a novel method dubbed Reliable Thinking with Images (RTWI). In brief, RTWI estimates the reliability of visual cues and textual CoT in a unified text-centric manner and accordingly employs robust filtering and voting modules to prevent NT from contaminating the final answer. Extensive experiments on seven benchmarks verify the effectiveness of RTWI against NT.

Problem

Research questions and friction points this paper is trying to address.

Noisy Thinking

Thinking with Images

Multimodal Large Language Models

Error Accumulation

Visual Cues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Noisy Thinking

Reliable Thinking with Images

Multimodal Reasoning