CookBench: A Long-Horizon Embodied Planning Benchmark for Complex Cooking Scenarios

๐Ÿ“… 2025-08-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing embodied planning benchmarks primarily focus on short-horizon tasks and coarse-grained actions, limiting their applicability to long-horizon, fine-grained manipulation in complex physical environments. Method: We introduce CookBenchโ€”the first high-fidelity, Unity-based benchmark for complex cooking tasks, featuring a two-stage paradigm encompassing intention recognition and physics-aware interaction. It formalizes long-horizon, multi-step, high-precision culinary challenges and proposes a novel fine-grained spatial-level action abstraction. A unified API toolkit supports both macro-level decision-making and micro-level physical interaction, integrated with LLMs and VLMs for end-to-end intention parsing and hierarchical action planning. Contribution/Results: Experiments expose critical limitations of current proprietary large models in long-horizon embodied reasoning. CookBench and its codebase are fully open-sourced, advancing embodied AI research in realistic, physics-grounded scenarios.

Technology Category

Application Category

๐Ÿ“ Abstract
Embodied Planning is dedicated to the goal of creating agents capable of executing long-horizon tasks in complex physical worlds. However, existing embodied planning benchmarks frequently feature short-horizon tasks and coarse-grained action primitives. To address this challenge, we introduce CookBench, a benchmark for long-horizon planning in complex cooking scenarios. By leveraging a high-fidelity simulation environment built upon the powerful Unity game engine, we define frontier AI challenges in a complex, realistic environment. The core task in CookBench is designed as a two-stage process. First, in Intention Recognition, an agent needs to accurately parse a user's complex intent. Second, in Embodied Interaction, the agent should execute the identified cooking goal through a long-horizon, fine-grained sequence of physical actions. Unlike existing embodied planning benchmarks, we refine the action granularity to a spatial level that considers crucial operational information while abstracting away low-level robotic control. Besides, We provide a comprehensive toolset that encapsulates the simulator. Its unified API supports both macro-level operations, such as placing orders and purchasing ingredients, and a rich set of fine-grained embodied actions for physical interaction, enabling researchers to focus on high-level planning and decision-making. Furthermore, we present an in-depth analysis of state-of-the-art, closed-source Large Language Model and Vision-Language Model, revealing their major shortcomings and challenges posed by complex, long-horizon tasks. The full benchmark will be open-sourced to facilitate future research.
Problem

Research questions and friction points this paper is trying to address.

Develops benchmark for long-horizon embodied planning in cooking
Addresses coarse action primitives in existing planning benchmarks
Evaluates AI models on complex intention recognition and execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

High-fidelity Unity simulation for complex tasks
Two-stage process: intent parsing and execution
Fine-grained actions with unified API support
๐Ÿ”Ž Similar Papers
No similar papers found.
M
Muzhen Cai
Harbin Institute of Technology
X
Xiubo Chen
Harbin Institute of Technology
Y
Yining An
Harbin Institute of Technology
J
Jiaxin Zhang
Harbin Institute of Technology
X
Xuesong Wang
Harbin Institute of Technology
Wang Xu
Wang Xu
Harbin Institute of Technology
natural language processingartificial intelligence
W
Weinan Zhang
Harbin Institute of Technology
T
Ting Liu
Harbin Institute of Technology