π€ AI Summary
This work addresses the vulnerability of off-policy maximum-entropy reinforcement learning methods, such as Soft Actor-Critic (SAC), to noisy state observations in low signal-to-noise ratio financial environments, which induces Q-value estimation bias and leads to a βfinancial entropy trap.β To mitigate this issue, the authors propose embedding a compact, bounded parameterized quantum circuit (PQC) at the front end of both policy and value networks. The PQC leverages trainable quantum entanglement to constrain representation propagation, thereby stabilizing Bellman target estimation at its source while preserving cross-asset interaction capabilities. This plug-and-play design circumvents the limitations of conventional post-hoc regularization or input filtering techniques. Empirical results demonstrate that the proposed approach achieves a 66.89% improvement in cumulative returns over standard SAC and outperforms the best continuous-control deep reinforcement learning baseline by approximately 27%, significantly enhancing out-of-sample profitability and stability in real-world portfolio management tasks.
π Abstract
The financial market is a typical low signal-to-noise ratio (SNR) setting, which often destabilizes off-policy maximum-entropy methods like Soft Actor-Critic (SAC). Specifically, noisy state representations may produce unreliable Q-value estimates, and bootstrapping amplifies these errors, forming a failure mode we call the "Financial Entropy Trap". In this paper, we propose FPQC-SAC, an efficient and plug-and-play SAC variant that places a compact and bounded Parameterized Quantum Circuit (PQC) before the actor and critic networks to constrain feature propagation at the representation level, rather than filtering raw inputs or regularizing Q-values after bootstrapping. Notably, FPQC-SAC reduces the impact of extreme market fluctuations on Bellman target estimation, while trainable quantum entanglement preserves flexible cross-asset interactions. Empirical evaluations on real-world portfolio management tasks demonstrate that FPQC-SAC substantially enhances out-of-sample stability and cumulative returns by achieving a 66.89% relative gain in cumulative return over standard unconstrained SAC and outperforms the best continuous-control deep reinforcement learning baseline by approximately 27%. Open-source code is available at https://github.com/ZeyuLIU-UST/FPQC-SAC-main.