OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation

πŸ“… 2026-06-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

168K/year
πŸ€– AI Summary
Traditional policy gradient methods optimize only the expected return, which often fails to meet practical requirements concerning distributional properties of returnsβ€”such as tail risk, robustness, or discovery of high-performing outliers. This work proposes OrderGrad, a general policy gradient estimator based on order statistics that extends the objective to a weighted average of sorted rewards (an L-statistic), thereby unifying optimization targets including Value-at-Risk (VaR), Conditional Value-at-Risk (CVaR), trimmed means, medians, and top-m averages. We provide the first unbiased gradient estimator applicable to any fixed sample size and arbitrary rank-based weights, integrating seamlessly into existing policy gradient frameworks via likelihood ratio and reparameterization techniques in a plug-and-play manner. Experiments demonstrate that OrderGrad significantly outperforms baseline methods in tasks such as mathematical post-training of large language models, particularly when deployment objectives diverge from mean return maximization.
πŸ“ Abstract
Policy-gradient methods usually optimize expected return, but many real world applications care about distributional properties of returns: tail risk, outlier robustness, or best-of-K discovery. We introduce OrderGrad, a family of likelihood-ratio and reparameterization gradient estimators for order-statistic objectives. OrderGrad optimizes finite-sample L-statistics, i.e., weighted averages of sorted rewards or costs, recovering objectives such as VaR, CVaR, trimmed means, medians, and top-m/best-of-K criteria by changing only the rank weights. For any fixed sample size and rank-weight vector, OrderGrad provides an unbiased gradient estimator for the corresponding order-statistic objective. The method is implemented as a simple reward transformation that can then be used in an otherwise standard policy-gradient or reparameterized update. We study the resulting estimator's variance behavior and evaluate it on tasks where mean optimization is mismatched to the deployment objective, including LLM math post-training and other tasks. OrderGrad provides a unified, plug-and-play route to risk-averse, robust, and exploratory learning. Code: https://github.com/paavo5/ordergrad
Problem

Research questions and friction points this paper is trying to address.

order-statistic
policy gradient
distributional objectives
risk-averse learning
robust optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

OrderGrad
order-statistic
policy gradient
L-statistics
risk-averse learning
πŸ”Ž Similar Papers