EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

📅 2025-02-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The absence of systematic evaluation benchmarks tailored for vision-driven embodied agents hinders the assessment and advancement of multimodal large language models (MLLMs) in realistic embodied settings. To address this, we introduce EMBench—the first comprehensive, multimodal benchmark for embodied evaluation of MLLMs—encompassing 1,128 cross-scenario tasks across simulation platforms including AI2-THOR and Habitat, and fine-grainedly categorizing six core embodied capabilities (e.g., fine-grained manipulation). EMBench pioneers a multi-granularity, embodiment-critical evaluation paradigm, moving beyond traditional language-centric metrics. Empirical evaluation across 13 state-of-the-art MLLMs reveals critical embodied limitations: GPT-4o achieves only 28.9% average accuracy on low-level manipulation tasks. We publicly release the benchmark code, task data, and an automated evaluation pipeline; EMBench has since become a community-standard evaluation platform for embodied intelligence.

Technology Category

Application Category

📝 Abstract
Leveraging Multi-modal Large Language Models (MLLMs) to create embodied agents offers a promising avenue for tackling real-world tasks. While language-centric embodied agents have garnered substantial attention, MLLM-based embodied agents remain underexplored due to the lack of comprehensive evaluation frameworks. To bridge this gap, we introduce EmbodiedBench, an extensive benchmark designed to evaluate vision-driven embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing tasks across four environments, ranging from high-level semantic tasks (e.g., household) to low-level tasks involving atomic actions (e.g., navigation and manipulation); and (2) six meticulously curated subsets evaluating essential agent capabilities like commonsense reasoning, complex instruction understanding, spatial awareness, visual perception, and long-term planning. Through extensive experiments, we evaluated 13 leading proprietary and open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only 28.9% on average. EmbodiedBench provides a multifaceted standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance MLLM-based embodied agents. Our code is available at https://embodiedbench.github.io.
Problem

Research questions and friction points this paper is trying to address.

Evaluates vision-driven embodied agents using MLLMs.
Assesses diverse tasks across multiple environments.
Highlights challenges in low-level manipulation tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal Large Language Models
Vision-Driven Embodied Agents
Comprehensive Evaluation Framework
🔎 Similar Papers
No similar papers found.
R
Rui Yang
University of Illinois Urbana-Champaign
H
Hanyang Chen
University of Illinois Urbana-Champaign
J
Junyu Zhang
University of Illinois Urbana-Champaign
Mark Zhao
Mark Zhao
University of Colorado Boulder
Computer SystemsSystems for MLCloud Computing
C
Cheng Qian
University of Illinois Urbana-Champaign
Kangrui Wang
Kangrui Wang
Northwestern University
Artificial intelligence
Qineng Wang
Qineng Wang
Northwestern University
Foundation ModelsEmbodied AgentsSpatial IntelligenceReasoning
T
Teja Venkat Koripella
University of Illinois Urbana-Champaign
M
Marziyeh Movahedi
Toyota Technological Institute at Chicago
Manling Li
Manling Li
Assistant Professor at Northwestern University
Natural Language ProcessingVision-LanguageEmbodied Agents
Heng Ji
Heng Ji
Professor of Computer Science, AICE Director, ASKS Director, UIUC, Amazon Scholar
Natural Language ProcessingLarge Language Models
H
Huan Zhang
University of Illinois Urbana-Champaign
T
Tong Zhang
University of Illinois Urbana-Champaign