pith. sign in

arxiv: 2606.06096 · v1 · pith:DMYZIDMHnew · submitted 2026-06-04 · 💻 cs.LG · cs.AI· cs.CL

OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation

Pith reviewed 2026-06-28 02:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords policy gradientorder statisticsL-statisticsrisk-sensitive reinforcement learningvalue at riskCVaRreinforcement learning
0
0 comments X

The pith

OrderGrad supplies unbiased gradient estimates for any fixed-sample order-statistic objective by a simple reward transformation before a standard policy-gradient step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Policy-gradient methods usually optimize expected return, yet many applications require optimizing other properties of the return distribution such as tail risk, robustness to outliers, or the best outcome among K trials. OrderGrad derives likelihood-ratio and reparameterization estimators that target finite-sample L-statistics, which are weighted averages of sorted rewards. For any fixed batch size and any fixed vector of rank weights, the resulting estimator is unbiased for the gradient of the chosen order-statistic objective. The method works by replacing each observed reward with a rank-dependent transformed value and then feeding the transformed values into an otherwise unchanged policy-gradient update. This single change recovers objectives such as VaR, CVaR, trimmed means, medians, and best-of-K criteria.

Core claim

For any fixed sample size and any fixed rank-weight vector, OrderGrad yields an unbiased gradient estimator of the corresponding finite-sample L-statistic objective; the estimator is realized simply by transforming each reward according to its rank within the batch before applying a standard policy-gradient or reparameterized update.

What carries the argument

The finite-sample L-statistic defined by a fixed rank-weight vector applied to a fixed number of sorted samples; the gradient estimator is obtained by weighting each sample's contribution by its rank-dependent transformed value.

If this is right

  • Any existing policy-gradient or reparameterized algorithm can optimize VaR, CVaR, trimmed means, or best-of-K criteria after only a reward transformation.
  • The same estimator applies unchanged to both on-policy likelihood-ratio and off-policy or reparameterized settings.
  • Variance of the estimator can be controlled by choice of rank weights without altering the underlying optimizer.
  • Tasks whose deployment objective differs from mean return, such as LLM math post-training, become directly addressable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If a bias-correction term could be derived, the rank weights might be allowed to adapt to the data without losing unbiasedness.
  • The batch-sorting construction may extend to continuous or infinite-horizon settings by replacing exact order statistics with suitable approximations.
  • The same reward-transformation idea could be applied outside reinforcement learning to any gradient-based optimizer whose loss is an order statistic.

Load-bearing premise

The number of samples used to form the order statistics must stay fixed and the rank weights must be chosen independently of the realized sample values.

What would settle it

For a simple differentiable policy and a known order-statistic objective, compute the true gradient analytically and compare it to the Monte-Carlo average of OrderGrad estimates over many independent batches of fixed size; any nonzero bias would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.06096 by Kohsei Matsutani, Paavo Parmas, Shota Takashiro, Soichiro Nishimori, Takeshi Kojima, Yongmin Kim, Yusuke Iwasawa, Yutaka Matsuo.

Figure 1
Figure 1. Figure 1: OrderGrad overview. Rank weights define a distributional objective over sorted rewards. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Diagnostic computation experiment. The panels visualize several rank-weight choices for [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Diagnostic gradient experiments. For x ∼ N (µ = 0.5, 1) and R(x) = −x 2 , we compare LR and RP estimates to the exact 20% CVaR gradient and study the SNR of Top-M@K estimators. panel reports empirical bias against k, comparing the LR and RP estimates to the exact gradient. Increasing k increases variance, but it also reduces bias relative to the exact CVaR gradient. Thus larger k approximates the target ob… view at source ↗
Figure 4
Figure 4. Figure 4: Task-average pass@k (k ≤ 256). Unweighted average over AIME24, AIME25, AMC23, MATH500, and Minerva (temperature 0.6, top-p 0.95, n = 1024 per problem). Our method with Top2@4 outperforms GRPO at large k and outperforms MaxPO (K = 4) overall on pass@k. report the unbiased pass@k [18] metric for k ∈ {1, 2, 4, 8, . . .}, computed as pass@k := Ex∼D" 1 − n−c k  n k  # , (23) where n is the number of sampled c… view at source ↗
Figure 5
Figure 5. Figure 5: Effective size of m and K on Qwen3-4B-Base. 10 0 10 1 10 2 10 3 10 4 response length (tokens) 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 density Qwen2.5-Math-7B (Minerva) Correct Incorrect Base Ours (Top m = 2, K=4) Ours (Top m = 2, Bottom m = 2, K=4) (a) Response length distribution by correct and incorrect subsets on Minerva. 1 2 4 8 16 32 64 128 256 512 1024 k (number of samples) 0.0 0.2 0.4 0.6 0.8 1.0 p … view at source ↗
Figure 6
Figure 6. Figure 6: Results for Multi-Reward Objectives with Correctness Reward and Length Penalty (temper [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Quantile weight profiles for q ∈ {0.03, 0.06, 0.10, 0.25, 0.5} with N = 400 and compar￾ison size k = 100. Smaller quantiles place mass on the lower tail of the sorted batch, while the median profile is centered near m = N/2. 1 2 3 4 5 6 7 8 Sorted index m 0.0 0.2 0.4 0.6 0.8 Weight k=1, ReMax k=2, ReMax k=3, ReMax k=4, ReMax k=5, ReMax k=6, ReMax k=7, ReMax [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: ReMax, or maximum-of-k, weight profiles for N = 8 and k ∈ {1, . . . , 7}. The k = 1 curve is uniform, corresponding to the ordinary mean, while larger k increasingly concentrates weight on the largest sorted values. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Rank-weight profiles for a comparison batch of size [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: TopM profiles for several choices of M and k with N = 100. These schemes interpolate between a strongly top-focused objective and the ordinary mean: when M = k, all ranks are averaged and the resulting profile is uniform. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Tail- and quantile-focused schemes for N = 100 and k = 20. TopM emphasizes high values, BotM emphasizes low values, TopBot places mass on both tails, and the quantile scheme concentrates around the specified lower quantile. 0 20 40 60 80 100 Sorted index m −0.02 −0.01 0.00 0.01 0.02 0.03 0.04 Weight GiniMeanDifference WinsorizedM 3 TrimM 3 Median [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Robust and signed schemes for N = 100 and k = 20. The median focuses on the center, the trimmed and winsorized means reduce sensitivity to extremes, and the Gini mean difference uses signed weights to contrast the upper and lower tails. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: High-yield tail-risk trading example. Panel (a) reports bad deployment probabilities: losing [PITH_FULL_IMAGE:figures/full_fig_p034_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Representative robust-regression fit panel. [PITH_FULL_IMAGE:figures/full_fig_p036_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Per-task pass@k on Qwen2.5-Math-7B. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Per-task pass@k on Qwen3-4B-Base [PITH_FULL_IMAGE:figures/full_fig_p039_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Per-benchmark pass@k curves on Qwen2.5-Math-7B with response length penalty. Top m = 2 by correctness reward; Bottom m = 2 by response-length reward. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Aggregate MinAtar performance. We compare OrderGrad PPO with [PITH_FULL_IMAGE:figures/full_fig_p041_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Effect of M on MinAtar without entropy regularization. We report aggregate normalized evaluation return across games. All OrderGrad curves use entropy coefficient 0.0. The best perform￾ance occurs around M = 9. of two hidden layers with ReLU and Tanh activations and outputs action logits. For OrderGrad PPO and PPO-Q, the critic head consists of two hidden layers and outputs per-action Q-values, yielding a… view at source ↗
Figure 20
Figure 20. Figure 20: Policy entropy under different values of [PITH_FULL_IMAGE:figures/full_fig_p042_20.png] view at source ↗
read the original abstract

Policy-gradient methods usually optimize expected return, but many real world applications care about distributional properties of returns: tail risk, outlier robustness, or best-of-K discovery. We introduce OrderGrad, a family of likelihood-ratio and reparameterization gradient estimators for order-statistic objectives. OrderGrad optimizes finite-sample L-statistics, i.e., weighted averages of sorted rewards or costs, recovering objectives such as VaR, CVaR, trimmed means, medians, and top-m/best-of-K criteria by changing only the rank weights. For any fixed sample size and rank-weight vector, OrderGrad provides an unbiased gradient estimator for the corresponding order-statistic objective. The method is implemented as a simple reward transformation that can then be used in an otherwise standard policy-gradient or reparameterized update. We study the resulting estimator's variance behavior and evaluate it on tasks where mean optimization is mismatched to the deployment objective, including LLM math post-training and other tasks. OrderGrad provides a unified, plug-and-play route to risk-averse, robust, and exploratory learning. Code: https://github.com/paavo5/ordergrad

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces OrderGrad, a family of likelihood-ratio and reparameterization gradient estimators for policy optimization of finite-sample order-statistic (L-statistic) objectives. For any fixed sample size N and fixed rank-weight vector w independent of the data, it claims the resulting estimators are unbiased for the corresponding weighted sum of order statistics, recovering objectives such as VaR, CVaR, trimmed means, medians, and top-m criteria via a simple reward transformation. The work includes variance analysis and empirical evaluation on tasks where mean optimization is mismatched to the deployment goal, including LLM math post-training.

Significance. If the unbiasedness result holds under the stated conditions, OrderGrad supplies a unified, plug-and-play route to optimizing non-mean objectives in reinforcement learning and policy gradients. This is significant for risk-averse, robust, and exploratory learning settings. The open-source code link is a positive contribution that supports reproducibility.

minor comments (2)
  1. The variance analysis mentioned in the abstract would benefit from a dedicated subsection with explicit variance expressions or bounds to make the estimator's behavior easier to compare with standard policy gradients.
  2. Figure captions and axis labels in the empirical section should explicitly state the sample size N and weight vector w used in each experiment to allow direct verification of the fixed-N, fixed-w condition.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of OrderGrad and the recommendation for minor revision. The provided summary accurately reflects the paper's focus on unbiased likelihood-ratio and reparameterization estimators for finite-sample L-statistic objectives via rank-based reward transformations.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core claim is that OrderGrad yields an unbiased gradient estimator for any fixed N and fixed rank-weight vector w by applying the likelihood-ratio identity (or reparameterization) to the L-statistic L = sum w_k R_{(k)}. This is a direct, standard extension of the policy-gradient identity to a well-defined functional of the N i.i.d. samples; the unbiasedness holds by construction of the LR trick once N and w are held constant and independent of the data. No self-citation chain, fitted parameter renamed as prediction, or self-definitional step appears in the derivation. The assumption is stated explicitly in the claim itself, rendering the result self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the method rests on standard policy-gradient assumptions; no free parameters, invented entities, or ad-hoc axioms are explicitly introduced.

axioms (1)
  • domain assumption Standard likelihood-ratio and reparameterization gradient estimators remain valid after the order-statistic reward transformation.
    The method is described as a simple reward transformation usable in otherwise standard updates.

pith-pipeline@v0.9.1-grok · 5763 in / 1152 out tokens · 38485 ms · 2026-06-28T02:15:12.650552+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

118 extracted references · 17 linked inside Pith

  1. [1]

    Acerbi, C. (2002). Spectral measures of risk: A coherent representation of subjective risk aversion.Journal of Banking & Finance, 26(7):1505–1518

  2. [2]

    and Tasche, D

    Acerbi, C. and Tasche, D. (2002a). Expected shortfall: A natural coherent alternative to value at risk.Economic Notes, 31(2):379–388

  3. [3]

    and Tasche, D

    Acerbi, C. and Tasche, D. (2002b). On the coherence of expected shortfall.Journal of Banking & Finance, 26(7):1487–1503

  4. [4]

    S., Courville, A

    Agarwal, R., Schwarzer, M., Castro, P. S., Courville, A. C., and Bellemare, M. (2021). Deep reinforcement learning at the edge of the statistical precipice.Advances in neural information processing systems, 34:29304–29320

  5. [5]

    Ahmadian, A., Cremer, C., Gallé, M., Fadaee, M., Kreutzer, J., Pietquin, O., Üstün, A., and Hooker, S. (2024). Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 12248–12267

  6. [6]

    C., Balakrishnan, N., and Nagaraja, H

    Arnold, B. C., Balakrishnan, N., and Nagaraja, H. N. (1992).A First Course in Order Statistics. John Wiley & Sons

  7. [7]

    Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256

  8. [8]

    Bagirov, F., Arkhipov, M., Sycheva, K., Glukhov, E., and Bogomolov, E. (2025). The best of N worlds: Aligning reinforcement learning with best-of-N sampling via max@k optimisation.arXiv preprint arXiv:2510.23393

  9. [9]

    W., Budden, D., Dabney, W., Horgan, D., Dhruva, T

    Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan, D., Dhruva, T. B., Muldal, A., Heess, N., and Lillicrap, T. P. (2018). Distributed distributional deterministic policy gradients. InInternational Conference on Learning Representations. 10

  10. [10]

    G., Dabney, W., and Munos, R

    Bellemare, M. G., Dabney, W., and Munos, R. (2017). A distributional perspective on rein- forcement learning. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 449–458. PMLR

  11. [11]

    G., Dabney, W., and Rowland, M

    Bellemare, M. G., Dabney, W., and Rowland, M. (2023).Distributional Reinforcement Learning. The MIT Press, Cambridge, MA

  12. [12]

    Bickel, P. J. and Lehmann, E. L. (1975). Descriptive statistics for nonparametric models. II. location.The Annals of Statistics, 3(5):1045–1069

  13. [13]

    Bu, D., Huang, W., Han, A., Nitanda, A., Xue, B., Zhang, Q., Wong, H.-S., and Suzuki, T. (2025). Consistency is not always correct: Towards understanding the role of exploration in post-training reasoning.arXiv preprint arXiv:2511.07368

  14. [14]

    Burda, Y ., Edwards, H., Storkey, A., and Klimov, O. (2018). Exploration by random network distillation.arXiv preprint arXiv:1810.12894

  15. [15]

    Cai, S., Gao, C., Zhang, Y ., Shi, W., Zhang, J., Bao, K., Wang, Q., and Feng, F. (2025). K-order ranking preference optimization for large language models. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T., editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 4844–4859, Vienna, Austria. Association for Computational ...

  16. [16]

    Cardoso, A. R. and Xu, H. (2019). Risk-averse stochastic convex bandit. InProceedings of the 22nd International Conference on Artificial Intelligence and Statistics, volume 89 ofProceedings of Machine Learning Research, pages 39–47

  17. [17]

    T., Krishnamurthy, A., and Foster, D

    Chen, F., Huang, A., Golowich, N., Malladi, S., Block, A., Ash, J. T., Krishnamurthy, A., and Foster, D. J. (2025a). The coverage principle: How pre-training enables post-training.arXiv preprint arXiv:2510.15020

  18. [18]

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. (2021). Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374

  19. [19]

    X., and Shi, G

    Chen, Z., Qin, X., Wu, Y ., Ling, Y ., Ye, Q., Zhao, W. X., and Shi, G. (2025b). Pass@k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751

  20. [20]

    X., Zhang, Z., and Wei, F

    Cheng, D., Huang, S., Zhu, X., Dai, B., Zhao, W. X., Zhang, Z., and Wei, F. (2025). Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758

  21. [21]

    Chow, Y ., Ghavamzadeh, M., Janson, L., and Pavone, M. (2018). Risk-constrained reinforcement learning with percentile risk criteria.Journal of Machine Learning Research, 18(167):1–51

  22. [22]

    Chow, Y ., Tamar, A., Mannor, S., and Pavone, M. (2015). Risk-sensitive and robust decision- making: A CVaR optimization approach. InAdvances in Neural Information Processing Systems, volume 28, pages 1522–1530

  23. [23]

    F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D

    Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, volume 30

  24. [24]

    Cui, G., Zhang, Y ., Chen, J., Yuan, L., Wang, Z., Zuo, Y ., Li, H., Fan, Y ., Chen, H., Chen, W., et al. (2025). The entropy mechanism of reinforcement learning for reasoning language models. arXiv preprint arXiv:2505.22617

  25. [25]

    Y ., Jegelka, S., and Krause, A

    Curi, S., Levy, K. Y ., Jegelka, S., and Krause, A. (2020). Adaptive sampling for stochastic risk-averse learning. InAdvances in Neural Information Processing Systems 33, pages 1036–1047

  26. [26]

    Dabney, W., Ostrovski, G., Silver, D., and Munos, R. (2018a). Implicit quantile networks for distributional reinforcement learning. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1096–1105. PMLR. 11

  27. [27]

    G., and Munos, R

    Dabney, W., Rowland, M., Bellemare, M. G., and Munos, R. (2018b). Distributional reinforce- ment learning with quantile regression. InProceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pages 2892–2901. AAAI Press

  28. [28]

    Dang, X., Baek, C., Wen, K., Kolter, Z., and Raghunathan, A. (2025). Weight ensembling improves reasoning in language models. InSecond Conference on Language Modeling

  29. [29]

    Daniell, P. J. (1920). Observations weighted according to order.American Journal of Mathem- atics, 42(4):222–236

  30. [30]

    Fan, Y ., Lyu, S., Ying, Y ., and Hu, B. (2017). Learning with average top-k loss. InAdvances in Neural Information Processing Systems 30

  31. [31]

    Gao, J., Pan, L., Wang, Y ., Zhong, R., Lu, C., Cai, Q., Jiang, P., and Zhao, X. (2025). Navigate the unknown: Enhancing LLM reasoning with intrinsic motivation guided exploration.arXiv preprint arXiv:2505.17621

  32. [32]

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. (2024). The Llama 3 herd of models.arXiv preprint arXiv:2407.21783

  33. [33]

    Guo, D. et al. (2025a). DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645(8081):633–638

  34. [34]

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. (2025b). DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948

  35. [35]

    W., Fried, D., and Welleck, S

    He, A. W., Fried, D., and Welleck, S. (2025). Rewarding the unlikely: Lifting GRPO beyond distribution sharpening. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V ., editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25548–25560, Suzhou, China. Association for Computational Linguistics

  36. [36]

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. (2021). Measuring mathematical problem solving with the MATH dataset.arXiv preprint arXiv:2103.03874

  37. [37]

    Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and Silver, D. (2018). Rainbow: Combining improvements in deep reinforcement learning. InProceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pages 3215–3222. AAAI Press

  38. [38]

    Holland, M. J. and Haress, E. M. (2021). Learning with risk-averse feedback under potentially heavy tails. InProceedings of the 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 892–900

  39. [39]

    Holland, M. J. and Haress, E. M. (2022). Spectral risk-based learning using unbounded losses. InProceedings of the 25th International Conference on Artificial Intelligence and Statistics, volume 151 ofProceedings of Machine Learning Research, pages 1871–1886

  40. [40]

    Holland, M. J. and Tanabe, K. (2023). A survey of learning criteria going beyond the usual risk. Journal of Artificial Intelligence Research, 78:781–821

  41. [41]

    Hu, S., Cai, X., Huang, Y ., Yao, Z., Zhang, L., Zhang, P., Deng, Y ., and Chen, K. (2025). Emergent slow thinking in LLMs as inverse tree freezing.arXiv preprint arXiv:2509.23629

  42. [42]

    Huber, P. J. and Ronchetti, E. M. (2009).Robust Statistics. John Wiley & Sons, 2 edition

  43. [43]

    Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., et al. (2024). OpenAI o1 system card.arXiv preprint arXiv:2412.16720

  44. [44]

    Jiang, Y ., Li, Y ., Chen, G., Liu, D., Cheng, Y ., and Shao, J. (2025). Rethinking entropy regularization in large reasoning models.arXiv preprint arXiv:2509.25133. 12

  45. [45]

    Khim, J., Leqi, L., Prasad, A., and Ravikumar, P. (2020). Uniform convergence of rank-weighted learning. InInternational conference on machine learning, pages 5254–5263. PMLR

  46. [46]

    Kingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. InInternational Conference on Learning Representations

  47. [47]

    Koyamada, S., Okano, S., Nishimori, S., Murata, Y ., Habara, K., Kita, H., and Ishii, S. (2023a). pgx: Hardware-accelerated parallel game simulators for reinforcement learning.Advances in Neural Information Processing Systems, 36:45716–45743

  48. [48]

    Koyamada, S., Parmas, P., Kozuno, T., and Ishii, S. (2023b). Emergence of exploration in policy gradient reinforcement learning via resetting. OpenReview submission to ICLR 2023. https://openreview.net/forum?id=GKsNIC_mQRG

  49. [49]

    Lambert, N., Morrison, J., Pyatkin, V ., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V ., Liu, A., Dziri, N., Lyu, X., Gu, Y ., Malik, S., Graf, V ., Hwang, J. D., Yang, J., Le Bras, R., Tafjord, O., Wilhelm, C., Soldaini, L., Smith, N. A., Wang, Y ., Dasigi, P., and Hajishirzi, H. (2025). Tulu 3: Pushing frontiers in open language model post-train...

  50. [50]

    L’Ecuyer, P. (1990). A unified view of the IPA, SF, and LR gradient estimation techniques. Management Science, 36(11):1364–1383

  51. [51]

    Leqi, L., Huang, A., Lipton, Z., and Azizzadenesheli, K. (2022). Supervised learning with general risk functionals. InInternational Conference on Machine Learning, pages 12570–12592. PMLR

  52. [52]

    J., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V

    Lewkowycz, A., Andreassen, A. J., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V . V ., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y ., Neyshabur, B., Gur-Ari, G., and Misra, V . (2022). Solving quantitative reasoning problems with language models. InAdvances in Neural Information Processing Systems

  53. [53]

    Li, T., Zhang, Y ., Yu, P., Saha, S., Khashabi, D., Weston, J., Lanchantin, J., and Wang, T. (2025). Jointly reinforcing diversity and quality in language model generations.arXiv preprint arXiv:2509.02534

  54. [54]

    Liang, Z., Lu, S., Yu, W., Panaganti, K., Zhou, Y ., Mi, H., and Yu, D. (2025). Can LLMs guide their own exploration? gradient-guided reinforcement learning for LLM reasoning.arXiv preprint arXiv:2512.15687

  55. [55]

    S., and Lin, M

    Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., and Lin, M. (2025). Understanding r1-zero-like training: A critical perspective. InConference on Language Modeling (COLM)

  56. [56]

    and Mendelson, S

    Lugosi, G. and Mendelson, S. (2021). Robust multivariate mean estimation: The optimality of trimmed mean.The Annals of Statistics, 49(1):393–410

  57. [57]

    G., and Castro, P

    Lyle, C., Bellemare, M. G., and Castro, P. S. (2019). A comparative analysis of expected and distributional reinforcement learning. InProceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, volume 33, pages 4504–4511

  58. [58]

    Matsutani, K., Takashiro, S., Minegishi, G., Kojima, T., Iwasawa, Y ., and Matsuo, Y . (2026). RL squeezes, SFT expands: A comparative study of reasoning LLMs. InThe Fourteenth International Conference on Learning Representations

  59. [59]

    A., Paudice, A., and Pontil, M

    Maurer, A., Parletta, D. A., Paudice, A., and Pontil, M. (2021). Robust unsupervised learning via L-statistic minimization. InInternational Conference on Machine Learning, pages 7524–7533. PMLR

  60. [60]

    Mavrin, B., Zhang, S., Yao, H., Kong, L., Wu, K., and Yu, Y . (2019). Distributional reinforce- ment learning for efficient exploration. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 4424–4434. PMLR. 13

  61. [61]

    and Rezende, D

    Mnih, A. and Rezende, D. J. (2016). Variational inference for monte carlo objectives. In Proceedings of the 33rd International Conference on Machine Learning, volume 48 ofProceedings of Machine Learning Research, pages 2188–2196

  62. [62]

    Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A. (2020). Monte carlo gradient estimation in machine learning.Journal of Machine Learning Research, 21(132):1–62

  63. [63]

    Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H., and Tanaka, T. (2010a). Nonpara- metric return distribution approximation for reinforcement learning. InProceedings of the 27th International Conference on Machine Learning, pages 799–806

  64. [64]

    Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H., and Tanaka, T. (2010b). Parametric return density estimation for reinforcement learning. InProceedings of the 26th Conference on Uncertainty in Artificial Intelligence, pages 368–375

  65. [65]

    Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V ., Saunders, W., Jiang, X., Cobbe, K., Eloundou, T., Krueger, G., Button, K., Knight, M., Chess, B., and Schulman, J. (2021). WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332

  66. [66]

    Nguyen-Tang, T., Gupta, S., and Venkatesh, S. (2021). Distributional reinforcement learning via moment matching. InProceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, volume 35, pages 9144–9152

  67. [67]

    Nishimori, S., Parmas, P., Koyamada, S., Kozuno, T., Kitamura, T., Ishii, S., and Matsuo, Y . (2026). Emergence of exploration in policy gradient reinforcement learning via retrying. In Proceedings of the International Conference on Machine Learning

  68. [68]

    and Tamir, A

    Ogryczak, W. and Tamir, A. (2003). Minimizing the sum of the k largest functions in linear time.Information Processing Letters, 85(3):117–122

  69. [69]

    O’Neill, B. (2025). The distribution of order statistics under sampling without replacement. Journal of Statistical Theory and Applications, 24:663–698

  70. [70]

    OpenAI, Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., McGrew, B., Petron, A., Paino, A., Plappert, M., Powell, G., Ribas, R., Schneider, J., Tezak, N., Tworek, J., Welinder, P., Weng, L., Yuan, Q., Zaremba, W., and Zhang, L. (2019). Solving Rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113

  71. [71]

    OpenAI, Andrychowicz, M., Baker, B., Chociej, M., Józefowicz, R., McGrew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., Schneider, J., Sidor, S., Tobin, J., Welinder, P., Weng, L., and Zaremba, W. (2020). Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 39(1):3–20

  72. [72]

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744

  73. [73]

    E., Peters, J., and Doya, K

    Parmas, P., Rasmussen, C. E., Peters, J., and Doya, K. (2018). PIPPS: Flexible model-based policy search robust to the curse of chaos. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 4065–4074

  74. [74]

    and Seno, T

    Parmas, P. and Seno, T. (2022). Proppo: A message passing framework for customizable and composable learning algorithms.Advances in Neural Information Processing Systems, 35:29152– 29165

  75. [75]

    and Sugiyama, M

    Parmas, P. and Sugiyama, M. (2021). A unified view of likelihood ratio and reparameterization gradients. InProceedings of the 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 4078–4086

  76. [76]

    and Schaal, S

    Peters, J. and Schaal, S. (2006). Policy gradient methods for robotics. InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. 14

  77. [77]

    and Schaal, S

    Peters, J. and Schaal, S. (2008). Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4):682–697

  78. [78]

    J., Mohamed, S., and Wierstra, D

    Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. InProceedings of the 31st International Conference on Machine Learning, volume 32 ofProceedings of Machine Learning Research, pages 1278–1286

  79. [79]

    Rockafellar, R. T. and Uryasev, S. (2000). Optimization of conditional value-at-risk.Journal of Risk, 2:21–42

  80. [80]

    Rockafellar, R. T. and Uryasev, S. (2002). Conditional value-at-risk for general loss distributions. Journal of Banking & Finance, 26(7):1443–1471

Showing first 80 references.