Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current multimodal large language models (MLLMs) excel at high-level planning in embodied intelligence but exhibit severe deficiencies in fine-grained action understanding. To address this gap, we propose CFG-Bench—the first systematic benchmark for evaluating embodied agents’ fine-grained action cognition, spanning four dimensions: physical interaction, temporal causality, intent comprehension, and evaluative judgment. Built upon 1,368 videos and 19,562 tri-modal question-answer pairs, it establishes the first multimodal fine-grained action evaluation framework. Our empirical analysis reveals significant shortcomings of state-of-the-art MLLMs in generating physically grounded instructions and performing higher-order action reasoning. Furthermore, we demonstrate that supervised fine-tuning substantially improves model performance on real-world embodied tasks. This work advances the translation of visual perception into executable action knowledge and introduces a novel paradigm for action-cognitive modeling in embodied intelligence.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) show promising results as decision-making engines for embodied agents operating in complex, physical environments. However, existing benchmarks often prioritize high-level planning or spatial reasoning, leaving the fine-grained action intelligence required for embodied physical interaction underexplored. To address this gap, we introduce CFG-Bench, a new benchmark designed to systematically evaluate this crucial capability. CFG-Bench consists of 1,368 curated videos paired with 19,562 three-modalities question-answer pairs targeting four cognitive abilities: 1) Physical Interaction, 2) Temporal-Causal Relation, 3) Intentional Understanding, and 4) Evaluative Judgment. Together, these dimensions provide a systematic framework for assessing a model's ability to translate visual observations into actionable knowledge, moving beyond mere surface-level recognition. Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions and exhibit profound limitations in the higher-order reasoning of intention and evaluation. Moreover, supervised fine-tuning (SFT) on our data demonstrates that teaching an MLLMs to articulate fine-grained actions directly translates to significant performance gains on established embodied benchmarks. Our analysis highlights these limitations and offers insights for developing more capable and grounded embodied agents.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking fine-grained action intelligence for embodied agents

Evaluating multimodal models' physical interaction and reasoning capabilities

Addressing limitations in translating visual observations to actionable knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates fine-grained physical interaction intelligence

Supervised fine-tuning improves embodied agent performance

Multimodal model translates visual observations to actions

🔎 Similar Papers

No similar papers found.

Authors to Follow