🤖 AI Summary
Optimizing the Sharpe ratio—defined as the ratio of expected reward to its standard deviation—in Markov decision processes (MDPs) is challenging due to its fractional structure, which violates the principle of optimality required by dynamic programming, and the inapplicability of conventional risk-sensitive methods. To address this, we propose the first Dinkelbach-based iterative policy optimization framework, which equivalently reformulates the original problem into a sequence of mean–variance subproblems. Our method integrates policy iteration with adaptive updates of a risk-sensitivity parameter, unifying treatment of both average- and discounted-reward settings in infinite-horizon MDPs. We establish theoretical guarantees of monotonic convergence to the globally optimal Sharpe ratio. Numerical experiments demonstrate superior performance in highly risk-sensitive environments. This work pioneers the systematic integration of dynamic programming principles into fractional-risk-objective optimization, delivering a rigorous, efficient, and scalable paradigm for Sharpe-ratio MDPs.
📝 Abstract
Sharpe ratio (also known as reward-to-variability ratio) is a widely-used metric in finance, which measures the additional return at the cost of per unit of increased risk (standard deviation of return). However, the optimization of Sharpe ratio in Markov decision processes (MDPs) is challenging, because there exist two difficulties hindering the application of dynamic programming. One is that dynamic programming does not work for fractional objectives, and the other is that dynamic programming is invalid for risk metrics. In this paper, we study the Sharpe ratio optimization in infinite-horizon MDPs, considering both the long-run average and discounted settings. We address the first challenge with the Dinkelbachs transform, which converts the Sharpe ratio objective to a mean-squared-variance (M2V) objective. It is shown that the M2V optimization and the original Sharpe ratio optimization share the same optimal policy when the risk-sensitive parameter is equal to the optimal Sharpe ratio. For the second challenge, we develop an iterative algorithm to solve the M2V optimization which is similar to a mean-variance optimization in MDPs. We iteratively solve the M2V problem and obtain the associated Sharpe ratio that is used to update the risk-sensitive parameter in the next iteration of M2V problems. We show that such a sequence of Sharpe ratios derived is monotonically increasing and converges to the optimal Sharpe ratio. For both average and discounted MDP settings, we develop a policy iteration procedure and prove its convergence to the optimum. Numerical experiments are conducted for validation. To the best of our knowledge, our approach is the first that solves the Sharpe ratio optimization in MDPs with dynamic programming type algorithms. We believe that the proposed algorithm can shed light on solving MDPs with other fractional objectives.