Learning in complex action spaces without policy gradients

📅 2024-10-08

🏛️ Trans. Mach. Learn. Res.

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work investigates the fundamental reasons behind the performance gap between policy gradient (PG) and action-value methods as action-space complexity increases, challenging the perceived necessity of PG in complex action spaces. We identify three general design principles—action-space decomposition, target-distribution modeling, and gradient-friendly value-function parameterization—that are transferable from PG to value-based learning. Leveraging these, we propose QMLE (Q-learning with Maximum Likelihood Estimation), the first PG-free Q-learning framework for continuous control. QMLE systematically demonstrates that PG’s empirical advantages stem from scalable engineering principles—not inherent paradigm superiority. Evaluated on the DeepMind Control Suite, QMLE matches or exceeds state-of-the-art PG methods (e.g., DMPO, D4PG) in performance while maintaining comparable computational overhead. This establishes a novel, principled paradigm for value-based learning in high-dimensional and continuous action spaces.

Technology Category

Application Category

📝 Abstract

Conventional wisdom suggests that policy gradient methods are better suited to complex action spaces than action-value methods. However, foundational studies have shown equivalences between these paradigms in small and finite action spaces (O'Donoghue et al., 2017; Schulman et al., 2017a). This raises the question of why their computational applicability and performance diverge as the complexity of the action space increases. We hypothesize that the apparent superiority of policy gradients in such settings stems not from intrinsic qualities of the paradigm, but from universal principles that can also be applied to action-value methods to serve similar functionality. We identify three such principles and provide a framework for incorporating them into action-value methods. To support our hypothesis, we instantiate this framework in what we term QMLE, for Q-learning with maximum likelihood estimation. Our results show that QMLE can be applied to complex action spaces with a controllable computational cost that is comparable to that of policy gradient methods, all without using policy gradients. Furthermore, QMLE demonstrates strong performance on the DeepMind Control Suite, even when compared to the state-of-the-art methods such as DMPO and D4PG.

Problem

Research questions and friction points this paper is trying to address.

Comparing policy gradient and action-value methods in complex action spaces

Investigating performance divergence as action space complexity increases

Developing QMLE framework for action-value methods without policy gradients

Innovation

Methods, ideas, or system contributions that make the work stand out.

Q-learning with maximum likelihood estimation

Framework for action-value methods principles

Complex action spaces without policy gradients

🔎 Similar Papers

PWM: Policy Learning with Multi-Task World Models