How well can LLMs provide planning feedback in grounded environments?

📅 2025-09-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically evaluates the capability of large language models (LLMs) and vision-language models (VLMs) to generate planning feedback—such as binary judgments, preference rankings, or action/goal suggestions—in three embodied planning settings: symbolic, linguistic, and continuous control. Method: We propose a multimodal feedback framework grounded in pretrained foundation models, integrating in-context learning, chain-of-thought reasoning, and dynamic environment perception to produce interpretable, multi-granularity feedback. Contribution/Results: Experiments show that model scale and advanced reasoning significantly improve feedback quality, achieving near-human performance on symbolic and linguistic tasks; large models notably reduce reliance on handcrafted reward functions and expert demonstrations. However, performance degrades in high-dimensional continuous state-action spaces. To our knowledge, this is the first cross-modal, cross-control-paradigm quantitative assessment of large models’ planning feedback capabilities—offering a novel pathway toward autonomous, trustworthy embodied agents.

Technology Category

Application Category

📝 Abstract
Learning to plan in grounded environments typically requires carefully designed reward functions or high-quality annotated demonstrations. Recent works show that pretrained foundation models, such as large language models (LLMs) and vision language models (VLMs), capture background knowledge helpful for planning, which reduces the amount of reward design and demonstrations needed for policy learning. We evaluate how well LLMs and VLMs provide feedback across symbolic, language, and continuous control environments. We consider prominent types of feedback for planning including binary feedback, preference feedback, action advising, goal advising, and delta action feedback. We also consider inference methods that impact feedback performance, including in-context learning, chain-of-thought, and access to environment dynamics. We find that foundation models can provide diverse high-quality feedback across domains. Moreover, larger and reasoning models consistently provide more accurate feedback, exhibit less bias, and benefit more from enhanced inference methods. Finally, feedback quality degrades for environments with complex dynamics or continuous state spaces and action spaces.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM feedback quality in grounded planning environments
Assessing diverse feedback types across symbolic and continuous domains
Analyzing how model size and inference methods impact feedback accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using LLMs and VLMs for diverse planning feedback
Evaluating feedback types like binary and action advising
Employing inference methods such as chain-of-thought
🔎 Similar Papers
No similar papers found.
Y
Yuxuan Li
David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
Victor Zhong
Victor Zhong
Assistant Professor at Cheriton School of Computer Science, University of Waterloo
artificial intelligencemachine learningnatural language processing