🤖 AI Summary
This study addresses the problem of high-frequency directional trading in limit order book environments by proposing an optimization framework that integrates dynamic order flow state representations with policy gradient reinforcement learning. Building upon the Proximal Policy Optimization (PPO) algorithm, the approach introduces GRPO/GSPO variants inspired by DeepSeekMath, incorporating group-normalized parameter updates and a downside risk-aware reward shaping mechanism. A simplified backtesting environment based on spread scaling is also developed to facilitate efficient evaluation. Empirical results on AMZN, AAPL, and GOOG demonstrate that the proposed method significantly outperforms Q-learning baselines, achieving substantial improvements in average net PnL, profitability, and maximum drawdown. This work marks the first successful application of DeepSeekMath-inspired techniques to the optimization of high-frequency trading strategies.
📝 Abstract
This paper studies reinforcement learning for high-frequency trading on limit order books by pairing an Order-Flow-based state model with policy-gradient methods. Instead of value-based RL techniques like tabular Q-learning, our approach deploys policy-based methods like vanilla PPO and DeepSeekMath-inspired variants like GRPO and GSPO, that use group-normalized updates and downside-aware shaping. On backtests with financial assets AMZN, AAPL, and GOOG under a simplified backtesting setup based on spread-scaled rewards, these new policies improve net average PnL, profitability, and drawdown over the Q-Learning baseline. Our results show that (1) Order-Flow signals are an adequate state for policy RL and (2) group-aware PPO surrogates are preferable over value-based baselines.