🤖 AI Summary
Current multimodal large language models (MLLMs) excel at high-level planning in embodied intelligence but exhibit severe deficiencies in fine-grained action understanding. To address this gap, we propose CFG-Bench—the first systematic benchmark for evaluating embodied agents’ fine-grained action cognition, spanning four dimensions: physical interaction, temporal causality, intent comprehension, and evaluative judgment. Built upon 1,368 videos and 19,562 tri-modal question-answer pairs, it establishes the first multimodal fine-grained action evaluation framework. Our empirical analysis reveals significant shortcomings of state-of-the-art MLLMs in generating physically grounded instructions and performing higher-order action reasoning. Furthermore, we demonstrate that supervised fine-tuning substantially improves model performance on real-world embodied tasks. This work advances the translation of visual perception into executable action knowledge and introduces a novel paradigm for action-cognitive modeling in embodied intelligence.
📝 Abstract
Multimodal Large Language Models (MLLMs) show promising results as decision-making engines for embodied agents operating in complex, physical environments. However, existing benchmarks often prioritize high-level planning or spatial reasoning, leaving the fine-grained action intelligence required for embodied physical interaction underexplored. To address this gap, we introduce CFG-Bench, a new benchmark designed to systematically evaluate this crucial capability. CFG-Bench consists of 1,368 curated videos paired with 19,562 three-modalities question-answer pairs targeting four cognitive abilities: 1) Physical Interaction, 2) Temporal-Causal Relation, 3) Intentional Understanding, and 4) Evaluative Judgment. Together, these dimensions provide a systematic framework for assessing a model's ability to translate visual observations into actionable knowledge, moving beyond mere surface-level recognition. Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions and exhibit profound limitations in the higher-order reasoning of intention and evaluation. Moreover, supervised fine-tuning (SFT) on our data demonstrates that teaching an MLLMs to articulate fine-grained actions directly translates to significant performance gains on established embodied benchmarks. Our analysis highlights these limitations and offers insights for developing more capable and grounded embodied agents.