Mind with Eyes: from Language Reasoning to Multimodal Reasoning

📅 2025-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) lack dynamic interaction and embodied cognition capabilities in multimodal reasoning. To address this, this paper systematically surveys recent advances and proposes a dual-paradigm framework—“language-centric” and “cooperative”—for multimodal reasoning. Methodologically, it integrates vision-language understanding, active visual perception, action-driven reasoning, and state modeling into a unified technical pathway supporting full-modality input and embodied behavior generation. It also introduces the first comprehensive taxonomy and benchmark task map covering perception, reasoning, action, and state updating in multimodal reasoning. Contributions include: (1) a precise delineation of the boundaries between the two paradigms; (2) an evolutionary roadmap from vision-language reasoning toward fully multimodal agents; and (3) an evaluable theoretical framework and technical guidance for embodied intelligence and general multimodal cognition.

Technology Category

Application Category

📝 Abstract
Language models have recently advanced into the realm of reasoning, yet it is through multimodal reasoning that we can fully unlock the potential to achieve more comprehensive, human-like cognitive capabilities. This survey provides a systematic overview of the recent multimodal reasoning approaches, categorizing them into two levels: language-centric multimodal reasoning and collaborative multimodal reasoning. The former encompasses one-pass visual perception and active visual perception, where vision primarily serves a supporting role in language reasoning. The latter involves action generation and state update within reasoning process, enabling a more dynamic interaction between modalities. Furthermore, we analyze the technical evolution of these methods, discuss their inherent challenges, and introduce key benchmark tasks and evaluation metrics for assessing multimodal reasoning performance. Finally, we provide insights into future research directions from the following two perspectives: (i) from visual-language reasoning to omnimodal reasoning and (ii) from multimodal reasoning to multimodal agents. This survey aims to provide a structured overview that will inspire further advancements in multimodal reasoning research.
Problem

Research questions and friction points this paper is trying to address.

Advancing language models to multimodal reasoning for human-like cognition
Surveying and categorizing recent multimodal reasoning approaches
Identifying challenges and future directions in multimodal reasoning research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-centric multimodal reasoning with vision support
Collaborative multimodal reasoning with dynamic interaction
Evolution from visual-language to omnimodal reasoning
🔎 Similar Papers
Zhiyu Lin
Zhiyu Lin
Beijing Jiaotong University
Y
Yifei Gao
Beijing Jiaotong University
X
Xian Zhao
Beijing Jiaotong University
Y
Yunfan Yang
Beijing Jiaotong University
J
Jitao Sang
Beijing Jiaotong University