When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work investigates under what conditions large language models (LLMs) can serve as general-purpose black-box policy optimizers in place of conventional reinforcement learning algorithms. To this end, the authors propose Prompted Policy Optimization (PromptPO), a method that iteratively generates executable policies by prompting an LLM with Python-based descriptions of the state space, action space, and reward function, and refines them using environmental feedback. The first systematic evaluation demonstrates that PromptPO automatically synthesizes diverse policies—ranging from rule-based controllers to planning algorithms—and matches or exceeds standard RL baselines with fewer environment interactions on challenging exploration tasks, Meta-World benchmarks, and several real-world control problems. However, it remains limited in domains requiring fine-grained continuous control, such as MuJoCo tasks.

📝 Abstract

We study when large language models (LLMs) can serve as effective black-box policy optimizers for reinforcement learning (RL) tasks, i.e., when can we replace classical RL algorithms with an LLM? We explore this question by introducing Prompted Policy Optimization (PromptPO), an iterative method that prompts an LLM with Python descriptions of the state space, action space, and reward function, then has it generate and refine executable policies based on rollout feedback. Across hard exploration environments, Meta-World robotics tasks, and several real-world control problems, PromptPO often matches or exceeds the performance of standard RL baselines while using substantially fewer environment interactions. To maximize expected return, and without further explicit prompting, the policies PromptPO outputs range from tuned proportional controllers or rule-based plans to policies that run planning algorithms like value iteration. Our results demonstrate that LLM-based policy optimization is sufficient when the LLM can leverage prior knowledge about the environment or optimization strategy. PromptPO underperforms standard RL baselines in MuJoCo domains. This demonstrates possible limitations of LLM-based policy optimization to settings that requiring fine-grained continuous control.

Problem

Research questions and friction points this paper is trying to address.

large language models

policy optimization

reinforcement learning

sequential decision-making

black-box optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompted Policy Optimization

Large Language Models

Reinforcement Learning