🤖 AI Summary
In continuous control tasks, Actor-Critic methods often suffer from high-frequency policy oscillations due to instability in the Q-function gradient field, hindering deployment on physical systems. This work reveals, for the first time, that policy nonsmoothness arises from the ratio between mixed partial derivatives of the Q-function and the curvature of the action space. To address this, the authors propose the PAVE framework, which leverages implicit differentiation to analyze the Q-gradient field and introduces a value-field-based regularization on the Critic side. This regularization minimizes gradient fluctuations while preserving local curvature information. Notably, PAVE achieves policy smoothness and robustness comparable to actor-side regularization—without modifying the Actor—and maintains strong task performance.
📝 Abstract
Policies learned via continuous actor-critic methods often exhibit erratic, high-frequency oscillations, making them unsuitable for physical deployment. Current approaches attempt to enforce smoothness by directly regularizing the policy's output. We argue that this approach treats the symptom rather than the cause. In this work, we theoretically establish that policy non-smoothness is fundamentally governed by the differential geometry of the critic. By applying implicit differentiation to the actor-critic objective, we prove that the sensitivity of the optimal policy is bounded by the ratio of the Q-function's mixed-partial derivative (noise sensitivity) to its action-space curvature (signal distinctness). To empirically validate this theoretical insight, we introduce PAVE (Policy-Aware Value-field Equalization), a critic-centric regularization framework that treats the critic as a scalar field and stabilizes its induced action-gradient field. PAVE rectifies the learning signal by minimizing the Q-gradient volatility while preserving local curvature. Experimental results demonstrate that PAVE achieves smoothness and robustness comparable to policy-side smoothness regularization methods, while maintaining competitive task performance, without modifying the actor.