Visual Abstract Thinking Empowers Multimodal Reasoning

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Images contain rich visual details but suffer from high redundancy, which impairs the visual reasoning capabilities of multimodal large language models (MLLMs). Method: We propose Visual Abstraction Thinking (VAT), the first framework to integrate human-inspired abstraction cognition into multimodal reasoning. VAT generates concise visual abstract representations—replacing verbose language-based reasoning chains or external tool calls—via three core strategies: saliency masking, topological simplification, and semantic distillation. By focusing on salient visual elements, VAT reduces information redundancy and enhances attention concentration, while remaining compatible with existing paradigms such as Chain-of-Thought (CoT). Contribution/Results: VAT achieves an average 17% improvement over the GPT-4o baseline across diverse visual reasoning tasks. It demonstrates consistent gains in concept identification, structural understanding, and relational reasoning, and synergizes effectively with knowledge-intensive CoT for further optimization.

Technology Category

Application Category

📝 Abstract
Images usually convey richer detail than text, but often include redundant information which potentially downgrades multimodal reasoning performance. When faced with lengthy or complex messages, humans tend to employ abstract thinking to convert them into simple and concise abstracts. Inspired by this cognitive strategy, we introduce Visual Abstract Thinking (VAT), a novel thinking paradigm that prompts Multimodal Large Language Models (MLLMs) with visual abstract instead of explicit verbal thoughts or elaborate guidance, permitting a more concentrated visual reasoning mechanism. Explicit thinking, such as Chain-of-thought (CoT) or tool-augmented approaches, increases the complexity of reasoning process via inserting verbose intermediate steps, external knowledge or visual information. In contrast, VAT reduces redundant visual information and encourages models to focus their reasoning on more essential visual elements. Experimental results show that VAT consistently empowers different models, and achieves an average gain of 17% over GPT-4o baseline by employing diverse types of visual abstracts, demonstrating that VAT can enhance visual reasoning abilities for MLLMs regarding conceptual, structural and relational reasoning tasks. VAT is also compatible with CoT in knowledge-intensive multimodal reasoning tasks. These findings highlight the effectiveness of visual reasoning via abstract thinking and encourage further exploration of more diverse reasoning paradigms from the perspective of human cognition.
Problem

Research questions and friction points this paper is trying to address.

Reduces redundant visual information in multimodal reasoning
Enhances focus on essential visual elements for reasoning
Improves performance in conceptual, structural, relational reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Abstract Thinking (VAT) enhances MLLMs
VAT reduces redundant visual information
VAT improves reasoning by focusing on essentials
🔎 Similar Papers
No similar papers found.
D
Dairu Liu
College of Software, Nankai University, Tianjin, China
Z
Ziyue Wang
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
M
Minyuan Ruan
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
Fuwen Luo
Fuwen Luo
Tsinghua University
Computer Science
C
Chi Chen
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
P
Peng Li
Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
Y
Yang Liu
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China; Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China