Improving the Data-efficiency of Reinforcement Learning by Warm-starting with LLM

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low data efficiency and high exploration cost of reinforcement learning (RL) in classical Markov decision processes (MDPs), this paper proposes LORO: a novel framework that leverages large language models (LLMs) as transferable policy priors to automatically generate high-quality, coverage-optimized offline datasets spanning near-optimal state-action spaces—thereby bootstrapping offline RL algorithms. LORO enables co-optimization between LLM-derived priors and standard Q-learning or actor-critic methods, preserving theoretical convergence guarantees while substantially improving sample efficiency. Evaluated on benchmark OpenAI Gym tasks—including CartPole and Pendulum—LORO achieves up to a 4× improvement in cumulative reward over online RL baselines, drastically reduces required interaction samples, and accelerates convergence. Its core contribution is a principled, LLM-driven mechanism for generating expressive, task-agnostic policy priors, establishing a new paradigm for sample-efficient RL.

Technology Category

Application Category

📝 Abstract
We investigate the usage of Large Language Model (LLM) in collecting high-quality data to warm-start Reinforcement Learning (RL) algorithms for learning in some classical Markov Decision Process (MDP) environments. In this work, we focus on using LLM to generate an off-policy dataset that sufficiently covers state-actions visited by optimal policies, then later using an RL algorithm to explore the environment and improve the policy suggested by the LLM. Our algorithm, LORO, can both converge to an optimal policy and have a high sample efficiency thanks to the LLM's good starting policy. On multiple OpenAI Gym environments, such as CartPole and Pendulum, we empirically demonstrate that LORO outperforms baseline algorithms such as pure LLM-based policies, pure RL, and a naive combination of the two, achieving up to $4 imes$ the cumulative rewards of the pure RL baseline.
Problem

Research questions and friction points this paper is trying to address.

Using LLM to improve Reinforcement Learning data efficiency
Generating high-quality off-policy datasets for optimal policy coverage
Enhancing RL sample efficiency with LLM warm-starting
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM generates high-quality off-policy dataset
Warm-start RL with LLM-suggested initial policy
LORO combines LLM and RL for efficiency
🔎 Similar Papers
No similar papers found.