How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language model (MLLM) evaluations predominantly assess final outputs, failing to capture the dynamic reasoning processes underlying collaborative capabilities such as visual perception, spatial reasoning, and object inference. Method: This paper introduces MM-Escape, a novel benchmark, and EscapeCraft, an interactive escape-room environment, establishing the first intermediate-behavior–centered paradigm for multimodal reasoning evaluation. The framework integrates vision-language joint modeling, trajectory logging, and multi-granularity quantitative analysis—spanning actions, states, and decisions—to systematically identify fine-grained failure patterns (e.g., redundant exploration, spatial disorientation, incorrect item usage). Contribution/Results: Experiments reveal that state-of-the-art MLLMs approach human-level strategies on simple tasks but suffer sharp performance degradation under increased complexity. Crucially, distinct models exhibit heterogeneous bottlenecks, enabling interpretable, diagnostic insights for targeted reasoning capability alignment.

Technology Category

Application Category

📝 Abstract
The rapid advancing of Multimodal Large Language Models (MLLMs) has spurred interest in complex multimodal reasoning tasks in the real-world and virtual environment, which require coordinating multiple abilities, including visual perception, visual reasoning, spatial awareness, and target deduction. However, existing evaluations primarily assess the final task completion, often degrading assessments to isolated abilities such as visual grounding and visual question answering. Less attention is given to comprehensively and quantitatively analyzing reasoning process in multimodal environments, which is crucial for understanding model behaviors and underlying reasoning mechanisms beyond merely task success. To address this, we introduce MM-Escape, an extensible benchmark for investigating multimodal reasoning, inspired by real-world escape games. MM-Escape emphasizes intermediate model behaviors alongside final task completion. To achieve this, we develop EscapeCraft, a customizable and open environment that enables models to engage in free-form exploration for assessing multimodal reasoning. Extensive experiments show that MLLMs, regardless of scale, can successfully complete the simplest room escape tasks, with some exhibiting human-like exploration strategies. Yet, performance dramatically drops as task difficulty increases. Moreover, we observe that performance bottlenecks vary across models, revealing distinct failure modes and limitations in their multimodal reasoning abilities, such as repetitive trajectories without adaptive exploration, getting stuck in corners due to poor visual spatial awareness, and ineffective use of acquired props, such as the key. We hope our work sheds light on new challenges in multimodal reasoning, and uncovers potential improvements in MLLMs capabilities.
Problem

Research questions and friction points this paper is trying to address.

Evaluates multimodal reasoning in complex environments
Identifies limitations in visual spatial awareness
Assesses adaptive exploration and prop usage
Innovation

Methods, ideas, or system contributions that make the work stand out.

MM-Escape benchmark for multimodal reasoning analysis
EscapeCraft environment for free-form exploration assessment
Identifies MLLMs' failure modes in complex tasks
🔎 Similar Papers
No similar papers found.
Z
Ziyue Wang
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
Yurui Dong
Yurui Dong
复旦大学
NLP MultiModal AI LLM
Fuwen Luo
Fuwen Luo
Tsinghua University
Computer Science
M
Minyuan Ruan
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
Zhili Cheng
Zhili Cheng
Tsinghua University
Embodied AINLP
C
Chi Chen
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
P
Peng Li
Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
Y
Yang Liu
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China; Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China