LLM-BABYBENCH: Understanding and Evaluating Grounded Planning and Reasoning in LLMs

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the challenge of evaluating large language models’ (LLMs) planning and reasoning capabilities in embodied interactive environments. To this end, we introduce LLM-BabyBench—the first text-based benchmark for embodied planning and reasoning, built upon the BabyAI grid-world platform. We propose a three-dimensional evaluation framework covering environment state prediction, action sequence planning, and high-level instruction decomposition. Our method features a novel expert-agent-driven data generation paradigm, integrating structured distillation, expert trajectory extraction, and interactive plan verification to enable end-to-end reproducible assessment. Experiments reveal significant performance bottlenecks of current LLMs on embodied reasoning tasks. All benchmark specifications, datasets, and generation/evaluation code are publicly released, establishing a measurable, scalable evaluation infrastructure for embodied intelligence.

Technology Category

Application Category

📝 Abstract

Assessing the capacity of Large Language Models (LLMs) to plan and reason within the constraints of interactive environments is crucial for developing capable AI agents. We introduce $ extbf{LLM-BabyBench}$, a new benchmark suite designed specifically for this purpose. Built upon a textual adaptation of the procedurally generated BabyAI grid world, this suite evaluates LLMs on three fundamental aspects of grounded intelligence: (1) predicting the consequences of actions on the environment state ($ extbf{Predict}$ task), (2) generating sequences of low-level actions to achieve specified objectives ($ extbf{Plan}$ task), and (3) decomposing high-level instructions into coherent subgoal sequences ($ extbf{Decompose}$ task). We detail the methodology for generating the three corresponding datasets ($ exttt{LLM-BabyBench-Predict}$, $ exttt{-Plan}$, $ exttt{-Decompose}$) by extracting structured information from an expert agent operating within the text-based environment. Furthermore, we provide a standardized evaluation harness and metrics, including environment interaction for validating generated plans, to facilitate reproducible assessment of diverse LLMs. Initial baseline results highlight the challenges posed by these grounded reasoning tasks. The benchmark suite, datasets, data generation code, and evaluation code are made publicly available ($href{https://github.com/choukrani/llm-babybench}{ ext{GitHub}}$, $href{https://huggingface.co/datasets/salem-mbzuai/LLM-BabyBench}{ ext{HuggingFace}}$).

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' planning in interactive environments

Evaluating LLMs' grounded reasoning with BabyAI tasks

Measuring action prediction, planning, and instruction decomposition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Textual adaptation of BabyAI grid world

Evaluates Predict, Plan, Decompose tasks

Standardized evaluation harness and metrics

🔎 Similar Papers

Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning