Trust-Region Twisted Policy Improvement

📅 2025-04-08

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

To address the policy improvement lag and objective misalignment between particle filtering and policy optimization in deep reinforcement learning (DRL) online planning, this paper proposes TRT-SMC: a Trust-Region-constrained Twisted Sequential Monte Carlo (SMC) planner. TRT-SMC integrates Monte Carlo Tree Search (MCTS)-inspired heuristics, constrained action sampling, explicit terminal-state modeling, and joint policy-value function estimation. Crucially, it introduces, for the first time within the SMC framework, a trust-region mechanism to harmonize particle propagation dynamics with policy improvement objectives. Evaluated on both discrete and continuous control benchmarks, TRT-SMC significantly outperforms standard MCTS and conventional SMC-based baselines, achieving superior runtime efficiency and sample efficiency. By resolving the inherent objective mismatch that has historically limited SMC’s applicability in DRL online planning, TRT-SMC establishes a new state of the art in trajectory-based planning under uncertainty.

Technology Category

Application Category

📝 Abstract

Monte-Carlo tree search (MCTS) has driven many recent breakthroughs in deep reinforcement learning (RL). However, scaling MCTS to parallel compute has proven challenging in practice which has motivated alternative planners like sequential Monte-Carlo (SMC). Many of these SMC methods adopt particle filters for smoothing through a reformulation of RL as a policy inference problem. Yet, persisting design choices of these particle filters often conflict with the aim of online planning in RL, which is to obtain a policy improvement at the start of planning. Drawing inspiration from MCTS, we tailor SMC planners specifically for RL by improving data generation within the planner through constrained action sampling and explicit terminal state handling, as well as improving policy and value target estimation. This leads to our Trust-Region Twisted SMC (TRT-SMC), which shows improved runtime and sample-efficiency over baseline MCTS and SMC methods in both discrete and continuous domains.

Problem

Research questions and friction points this paper is trying to address.

Scaling MCTS for parallel compute in RL

Conflicting particle filter designs for online RL planning

Improving SMC planners for better RL policy improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constrained action sampling for better data generation

Explicit terminal state handling in planning

Improved policy and value target estimation

🔎 Similar Papers

Analyzing and Bridging the Gap between Maximizing Total Reward and Discounted Reward in Deep Reinforcement Learning