🤖 AI Summary
To address the low data efficiency and high exploration cost of reinforcement learning (RL) in classical Markov decision processes (MDPs), this paper proposes LORO: a novel framework that leverages large language models (LLMs) as transferable policy priors to automatically generate high-quality, coverage-optimized offline datasets spanning near-optimal state-action spaces—thereby bootstrapping offline RL algorithms. LORO enables co-optimization between LLM-derived priors and standard Q-learning or actor-critic methods, preserving theoretical convergence guarantees while substantially improving sample efficiency. Evaluated on benchmark OpenAI Gym tasks—including CartPole and Pendulum—LORO achieves up to a 4× improvement in cumulative reward over online RL baselines, drastically reduces required interaction samples, and accelerates convergence. Its core contribution is a principled, LLM-driven mechanism for generating expressive, task-agnostic policy priors, establishing a new paradigm for sample-efficient RL.
📝 Abstract
We investigate the usage of Large Language Model (LLM) in collecting high-quality data to warm-start Reinforcement Learning (RL) algorithms for learning in some classical Markov Decision Process (MDP) environments. In this work, we focus on using LLM to generate an off-policy dataset that sufficiently covers state-actions visited by optimal policies, then later using an RL algorithm to explore the environment and improve the policy suggested by the LLM. Our algorithm, LORO, can both converge to an optimal policy and have a high sample efficiency thanks to the LLM's good starting policy. On multiple OpenAI Gym environments, such as CartPole and Pendulum, we empirically demonstrate that LORO outperforms baseline algorithms such as pure LLM-based policies, pure RL, and a naive combination of the two, achieving up to $4 imes$ the cumulative rewards of the pure RL baseline.