Efficient Off-Policy Learning for High-Dimensional Action Spaces

📅 2024-03-07

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

In off-policy deep reinforcement learning (RL) with high-dimensional action spaces, explicit Q-function estimation suffers from the curse of dimensionality and poor sample efficiency. Method: This paper proposes the first fully Q-function-free offline policy deep RL framework, relying solely on the state-value function (V). It trains a deep V-network via weighted importance sampling, employs twin V-networks to decouple estimation and mitigate bias, incorporates robust policy improvement with clipped importance weights, and provides theoretical variance analysis relative to V-trace. Contributions/Results: Evaluated on multiple benchmark tasks, the method achieves significantly improved sample efficiency and final performance, yielding more stable policies with higher returns. Its training is simpler and more robust compared to Q-based approaches, demonstrating strong empirical and theoretical advantages in high-dimensional continuous control settings.

Technology Category

Application Category

📝 Abstract

Existing off-policy reinforcement learning algorithms often rely on an explicit state-action-value function representation, which can be problematic in high-dimensional action spaces due to the curse of dimensionality. This reliance results in data inefficiency as maintaining a state-action-value function in such spaces is challenging. We present an efficient approach that utilizes only a state-value function as the critic for off-policy deep reinforcement learning. This approach, which we refer to as Vlearn, effectively circumvents the limitations of existing methods by eliminating the necessity for an explicit state-action-value function. To this end, we leverage a weighted importance sampling loss for learning deep value functions from off-policy data. While this is common for linear methods, it has not been combined with deep value function networks. This transfer to deep methods is not straightforward and requires novel design choices such as robust policy updates, twin value function networks to avoid an optimization bias, and importance weight clipping. We also present a novel analysis of the variance of our estimate compared to commonly used importance sampling estimators such as V-trace. Our approach improves sample complexity as well as final performance and ensures consistent and robust performance across various benchmark tasks. Eliminating the state-action-value function in Vlearn facilitates a streamlined learning process, yielding high-return agents.

Problem

Research questions and friction points this paper is trying to address.

Addresses inefficiency in high-dimensional action spaces

Eliminates need for state-action-value function

Improves sample complexity and performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes state-value function only

Employs weighted importance sampling loss

Introduces twin value function networks

🔎 Similar Papers

PWM: Policy Learning with Multi-Task World Models