Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) excel at high-level planning in embodied intelligence but exhibit severe deficiencies in fine-grained action understanding. To address this gap, we propose CFG-Bench—the first systematic benchmark for evaluating embodied agents’ fine-grained action cognition, spanning four dimensions: physical interaction, temporal causality, intent comprehension, and evaluative judgment. Built upon 1,368 videos and 19,562 tri-modal question-answer pairs, it establishes the first multimodal fine-grained action evaluation framework. Our empirical analysis reveals significant shortcomings of state-of-the-art MLLMs in generating physically grounded instructions and performing higher-order action reasoning. Furthermore, we demonstrate that supervised fine-tuning substantially improves model performance on real-world embodied tasks. This work advances the translation of visual perception into executable action knowledge and introduces a novel paradigm for action-cognitive modeling in embodied intelligence.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) show promising results as decision-making engines for embodied agents operating in complex, physical environments. However, existing benchmarks often prioritize high-level planning or spatial reasoning, leaving the fine-grained action intelligence required for embodied physical interaction underexplored. To address this gap, we introduce CFG-Bench, a new benchmark designed to systematically evaluate this crucial capability. CFG-Bench consists of 1,368 curated videos paired with 19,562 three-modalities question-answer pairs targeting four cognitive abilities: 1) Physical Interaction, 2) Temporal-Causal Relation, 3) Intentional Understanding, and 4) Evaluative Judgment. Together, these dimensions provide a systematic framework for assessing a model's ability to translate visual observations into actionable knowledge, moving beyond mere surface-level recognition. Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions and exhibit profound limitations in the higher-order reasoning of intention and evaluation. Moreover, supervised fine-tuning (SFT) on our data demonstrates that teaching an MLLMs to articulate fine-grained actions directly translates to significant performance gains on established embodied benchmarks. Our analysis highlights these limitations and offers insights for developing more capable and grounded embodied agents.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking fine-grained action intelligence for embodied agents
Evaluating multimodal models' physical interaction and reasoning capabilities
Addressing limitations in translating visual observations to actionable knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates fine-grained physical interaction intelligence
Supervised fine-tuning improves embodied agent performance
Multimodal model translates visual observations to actions
🔎 Similar Papers
No similar papers found.
D
Dayong Liu
Zhejiang University
C
Chao Xu
Wolf 1069B, Sany Group
W
Weihong Chen
Wolf 1069B, Sany Group
S
Suyu Zhang
Wolf 1069B, Sany Group
J
Juncheng Wang
The Hong Kong Polytechnic University
Jiankang Deng
Jiankang Deng
Imperial College London
Computer VisionMachine Learning
Baigui Sun
Baigui Sun
Wolf 1069 b Lab, Sany Group
人工智能、计算机视觉
Y
Yang Liu
Wolf 1069B, Sany Group