๐ค AI Summary
Existing embodied planning benchmarks primarily focus on short-horizon tasks and coarse-grained actions, limiting their applicability to long-horizon, fine-grained manipulation in complex physical environments. Method: We introduce CookBenchโthe first high-fidelity, Unity-based benchmark for complex cooking tasks, featuring a two-stage paradigm encompassing intention recognition and physics-aware interaction. It formalizes long-horizon, multi-step, high-precision culinary challenges and proposes a novel fine-grained spatial-level action abstraction. A unified API toolkit supports both macro-level decision-making and micro-level physical interaction, integrated with LLMs and VLMs for end-to-end intention parsing and hierarchical action planning. Contribution/Results: Experiments expose critical limitations of current proprietary large models in long-horizon embodied reasoning. CookBench and its codebase are fully open-sourced, advancing embodied AI research in realistic, physics-grounded scenarios.
๐ Abstract
Embodied Planning is dedicated to the goal of creating agents capable of executing long-horizon tasks in complex physical worlds. However, existing embodied planning benchmarks frequently feature short-horizon tasks and coarse-grained action primitives. To address this challenge, we introduce CookBench, a benchmark for long-horizon planning in complex cooking scenarios. By leveraging a high-fidelity simulation environment built upon the powerful Unity game engine, we define frontier AI challenges in a complex, realistic environment. The core task in CookBench is designed as a two-stage process. First, in Intention Recognition, an agent needs to accurately parse a user's complex intent. Second, in Embodied Interaction, the agent should execute the identified cooking goal through a long-horizon, fine-grained sequence of physical actions. Unlike existing embodied planning benchmarks, we refine the action granularity to a spatial level that considers crucial operational information while abstracting away low-level robotic control. Besides, We provide a comprehensive toolset that encapsulates the simulator. Its unified API supports both macro-level operations, such as placing orders and purchasing ingredients, and a rich set of fine-grained embodied actions for physical interaction, enabling researchers to focus on high-level planning and decision-making. Furthermore, we present an in-depth analysis of state-of-the-art, closed-source Large Language Model and Vision-Language Model, revealing their major shortcomings and challenges posed by complex, long-horizon tasks. The full benchmark will be open-sourced to facilitate future research.