🤖 AI Summary
This paper addresses three key limitations of Actor-Critic algorithms in continuous control: insufficient exploration, suboptimal policy convergence, and low sample efficiency. To this end, we propose Monte Carlo Beam Search (MCBS), an online action re-ranking mechanism that integrates short-horizon Monte Carlo rollouts with structured beam search. MCBS is the first to adapt beam search to continuous action spaces, enabling candidate action evaluation and adaptive re-ranking under deterministic policies within the TD3 framework. We further introduce a joint adaptive scheduling strategy for beam width and rollout depth. Empirical evaluation on benchmark tasks—including HalfCheetah-v4—demonstrates that MCBS improves sample efficiency by approximately 2× over TD3, SAC, PPO, and A2C: it achieves 90% of the optimal reward within only 200k environment steps, significantly accelerating convergence.
📝 Abstract
Actor-critic methods, like Twin Delayed Deep Deterministic Policy Gradient (TD3), depend on basic noise-based exploration, which can result in less than optimal policy convergence. In this study, we introduce Monte Carlo Beam Search (MCBS), a new hybrid method that combines beam search and Monte Carlo rollouts with TD3 to improve exploration and action selection. MCBS produces several candidate actions around the policy's output and assesses them through short-horizon rollouts, enabling the agent to make better-informed choices. We test MCBS across various continuous-control benchmarks, including HalfCheetah-v4, Walker2d-v5, and Swimmer-v5, showing enhanced sample efficiency and performance compared to standard TD3 and other baseline methods like SAC, PPO, and A2C. Our findings emphasize MCBS's capability to enhance policy learning through structured look-ahead search while ensuring computational efficiency. Additionally, we offer a detailed analysis of crucial hyperparameters, such as beam width and rollout depth, and explore adaptive strategies to optimize MCBS for complex control tasks. Our method shows a higher convergence rate across different environments compared to TD3, SAC, PPO, and A2C. For instance, we achieved 90% of the maximum achievable reward within around 200 thousand timesteps compared to 400 thousand timesteps for the second-best method.