Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing embodied intelligence benchmarks predominantly measure execution success rates, neglecting systematic evaluation of cognitive capabilities, while suffering from low task fidelity and incomplete assessment dimensions. This paper introduces RoboBench—the first comprehensive cognitive evaluation benchmark designed specifically for multimodal large language models (MLLMs) serving as “embodied brains.” It systematically assesses five core cognitive dimensions: instruction understanding, perceptual reasoning, general-purpose planning, functional prediction, and failure analysis. We propose the novel MLLM-as-world-simulator evaluation framework, which validates behavioral plausibility by simulating physical state transitions, and construct a high-fidelity, multi-view, attribute-rich robot-oriented question-answering dataset. Extensive experiments across 14 state-of-the-art MLLMs reveal critical bottlenecks in implicit instruction comprehension, spatiotemporal reasoning, cross-scenario planning, fine-grained functional modeling, and fault diagnosis.

Technology Category

Application Category

📝 Abstract
Building robots that can perceive, reason, and act in dynamic, unstructured environments remains a core challenge. Recent embodied systems often adopt a dual-system paradigm, where System 2 handles high-level reasoning while System 1 executes low-level control. In this work, we refer to System 2 as the embodied brain, emphasizing its role as the cognitive core for reasoning and decision-making in manipulation tasks. Given this role, systematic evaluation of the embodied brain is essential. Yet existing benchmarks emphasize execution success, or when targeting high-level reasoning, suffer from incomplete dimensions and limited task realism, offering only a partial picture of cognitive capability. To bridge this gap, we introduce RoboBench, a benchmark that systematically evaluates multimodal large language models (MLLMs) as embodied brains. Motivated by the critical roles across the full manipulation pipeline, RoboBench defines five dimensions-instruction comprehension, perception reasoning, generalized planning, affordance prediction, and failure analysis-spanning 14 capabilities, 25 tasks, and 6092 QA pairs. To ensure realism, we curate datasets across diverse embodiments, attribute-rich objects, and multi-view scenes, drawing from large-scale real robotic data. For planning, RoboBench introduces an evaluation framework, MLLM-as-world-simulator. It evaluate embodied feasibility by simulating whether predicted plans can achieve critical object-state changes. Experiments on 14 MLLMs reveal fundamental limitations: difficulties with implicit instruction comprehension, spatiotemporal reasoning, cross-scenario planning, fine-grained affordance understanding, and execution failure diagnosis. RoboBench provides a comprehensive scaffold to quantify high-level cognition, and guide the development of next-generation embodied MLLMs. The project page is in https://robo-bench.github.io.
Problem

Research questions and friction points this paper is trying to address.

Evaluates multimodal LLMs as embodied brains for robot cognition
Addresses incomplete assessment of high-level reasoning in manipulation tasks
Systematically tests instruction comprehension and planning across diverse scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

RoboBench benchmark evaluates MLLMs as embodied brains
Framework assesses five cognitive dimensions across 14 capabilities
Uses MLLM-as-world-simulator to verify plan feasibility
🔎 Similar Papers
No similar papers found.
Yulin Luo
Yulin Luo
Peking University
Data-centric AILLMVLMEmbodied AI
Chun-Kai Fan
Chun-Kai Fan
Peking University
M
Menghang Dong
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
J
Jiayu Shi
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
M
Mengdi Zhao
Institute for Brain and Intelligence, Fudan University
B
Bo-Wen Zhang
University of Science and Technology Beijing
Cheng Chi
Cheng Chi
Columbia University, Stanford University
robotics
J
Jiaming Liu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Gaole Dai
Gaole Dai
PhD Candidate, Peking University
AI X LifeScience
R
Rongyu Zhang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Ruichuan An
Ruichuan An
Xi'an Jiaotong University|Peking University
VLMData Centric AI
K
Kun Wu
Beijing Innovation Center of Humanoid Robotics
Zhengping Che
Zhengping Che
X-Humanoid
Embodied AIDeep Learning
S
Shaoxuan Xie
Beijing Academy of Artificial Intelligence
G
Guocai Yao
Beijing Academy of Artificial Intelligence
Z
Zhongxia Zhao
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Pengwei Wang
Pengwei Wang
University of Calgary
Computer Science Security
Guang Liu
Guang Liu
BAAI
AI,LLMData
Z
Zhongyuan Wang
Beijing Academy of Artificial Intelligence
Tiejun Huang
Tiejun Huang
Professor,School of Computer Science, Peking University
Visual Information Processing
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models