arxiv: 2605.06139 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Recognition: unknown

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

Clive Bai, Heming Zou, Kai Yang, Lizhou Cai, Qi Wang, Saiyong Yang, Weijie Liu, Wutong Xu, Xiangyang Ji, Yangkun Chen, Yingyue Li, Yixiu Mao, Yuhang Jiang, Yun Qu

Pith reviewed 2026-05-08 13:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Listwise Policy OptimizationRLVRpolicy gradientresponse simplexLLM post-traininggroup-based RLdivergence minimization

0 comments

The pith

Group-based RL for LLMs implicitly projects policies toward targets on the response simplex; LPO makes the projection explicit and exact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing group-based policy gradient methods sample multiple responses per prompt and update via relative advantages, but the paper shows they all amount to the same geometric move: they define an implicit target distribution over the simplex of possible responses and then nudge the policy toward it with a first-order approximation. Listwise Policy Optimization instead restricts the proximal RL objective strictly to that simplex and performs the projection by minimizing a chosen divergence exactly. This change produces monotonic improvement on the listwise objective, bounded zero-sum gradients that correct themselves, and the freedom to pick different divergences for different structural effects. Readers should care because the approach converts opaque heuristics into a transparent procedure that still beats standard baselines on reasoning benchmarks while keeping training stable and outputs diverse.

Core claim

Group-based methods share the structure of defining a target on the response simplex and approximating its projection; LPO demystifies this by restricting the proximal objective to the simplex and projecting exactly through divergence minimization, which supplies monotonic listwise gains, bounded self-correcting gradients, and divergence flexibility.

What carries the argument

Target-projection on the LLM response simplex, achieved by restricting the proximal RL objective to the simplex and performing exact divergence minimization in a decoupled step.

Load-bearing premise

That the implicit targets of existing group-based methods can be recovered or improved by restricting the proximal RL objective to the response simplex and minimizing a divergence exactly.

What would settle it

A controlled experiment on a reasoning benchmark where LPO, using the same implicit target as a standard group-relative baseline, shows no monotonic improvement or loses stability and diversity.

Figures

Figures reproduced from arXiv: 2605.06139 by Clive Bai, Heming Zou, Kai Yang, Lizhou Cai, Qi Wang, Saiyong Yang, Weijie Liu, Wutong Xu, Xiangyang Ji, Yangkun Chen, Yingyue Li, Yixiu Mao, Yuhang Jiang, Yun Qu.

**Figure 1.** Figure 1: LPO iteratively ascends the reward landscape via explicit targetprojection, enabling stable optimization and flexible divergence design. Recent advances have revealed the prominent potential of reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs) post-training, which incentivizes reasoning capabilities on complex problem-solving tasks (Guo et al., 2025; Jaech et al., 20… view at source ↗

**Figure 2.** Figure 2: Illustration of LPO, which performs explicit target projection on the LLM response simplex, in contrast to view at source ↗

**Figure 3.** Figure 3: Training curves of Pass@1 accuracy. Two LPO variants ( view at source ↗

**Figure 4.** Figure 4: Pass@k training curves. LPO variants (LPOfwd, LPOrev) are evaluated against group-based PG baselines (GRPO, Dr.GRPO, MaxRL, shown from top to bottom) across various LLMs and tasks under paired temperature settings. Specific k configurations are detailed per benchmark. 5.2 Training Performance Performance gains. Under paired temperature configurations, LPO consistently outperforms group-based PG baselines.… view at source ↗

**Figure 5.** Figure 5: Training dynamics of LPO variants and GRPO. Rows from top to bottom respectively show the curves of view at source ↗

**Figure 6.** Figure 6: Ablation comparing listwise LPO with point view at source ↗

**Figure 8.** Figure 8: Scalability validation. We compare LPO with GRPO by training Qwen3-14B-Base on the larger Polaris view at source ↗

**Figure 9.** Figure 9: Training dynamics of LPO variants and Dr.GRPO. Rows from top to bottom respectively show the curves view at source ↗

**Figure 10.** Figure 10: Training dynamics of LPO variants and MaxRL. Rows from top to bottom respectively show the curves view at source ↗

**Figure 11.** Figure 11: Generalization of LPO across diverse LLM families. Performance is evaluated on Countdown using view at source ↗

**Figure 12.** Figure 12: Empirical evaluation on the Countdown task under a fully on-policy regime (one gradient update per view at source ↗

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for large language models (LLMs) post-training to incentivize reasoning capacity. Among existing recipes, group-based policy gradient is prevalent, which samples a group of responses per prompt and updates the policy via group-relative advantage signals. This work reveals that these optimization strategies share a common geometric structure: each implicitly defines a target distribution on the response simplex and projects toward it via first-order approximation. Building on this insight, we propose Listwise Policy Optimization (LPO) to explicitly conduct the target-projection, which demystifies the implicit target by restricting the proximal RL objective to the response simplex, and then projects the policy via exact divergence minimization. This framework provides (i) monotonic improvement on the listwise objective with bounded, zero-sum, and self-correcting projection gradients, and (ii) flexibility in divergence selection with distinct structural properties through the decoupled projection step. On diverse reasoning tasks and LLM backbones, LPO consistently improves training performance over typical policy gradient baselines under matched targets, while intrinsically preserving optimization stability and response diversity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LPO unifies group-based RLVR as target projection on the response simplex and claims monotonic gains, but exact divergence minimization is not obviously workable for autoregressive LLMs without approximations that could drop the guarantees.

read the letter

The paper's main move is to show that common group-based policy gradients in LLM post-training all amount to defining an implicit target distribution over responses and taking a first-order step toward it. LPO makes this explicit by restricting the proximal objective to the response simplex and performing the projection via exact divergence minimization. This gives a clean separation between target choice and the projection operator, plus the ability to swap divergences and get different structural properties like bounded zero-sum gradients that self-correct.

Referee Report

2 major / 2 minor

Summary. The paper claims that group-based policy gradient methods in RLVR implicitly define target distributions on the response simplex and perform first-order approximations to project toward them. It proposes Listwise Policy Optimization (LPO), which explicitly restricts the proximal RL objective to the response simplex and conducts exact divergence minimization for the projection step. This is asserted to deliver monotonic improvement on the listwise objective with bounded, zero-sum, and self-correcting gradients, plus flexibility in divergence selection, while yielding better empirical results than standard policy gradient baselines on reasoning tasks with preserved stability and diversity.

Significance. If the geometric unification and exact-projection guarantees hold, the work offers a principled framework that demystifies implicit targets in group-based RLVR and enables divergence-flexible optimization with theoretical stability properties. This could improve post-training for LLM reasoning by providing monotonicity assurances absent in many current methods, and the empirical gains under matched targets suggest practical utility. The explicit target-projection view is a notable conceptual contribution if the implementation details support the claims.

major comments (2)

[§3 (target-projection construction)] The central claim of monotonic improvement via exact divergence minimization (abstract and §3) rests on performing an exact argmin over the response simplex. For autoregressive LLMs the policy is token-level, so the induced full-response distribution is not directly parameterized; any practical implementation must rely on sampling or variational approximations. It is not shown that these preserve the exact projection, the zero-sum property, or the self-correcting gradient behavior required for the monotonicity guarantee.
[§3.3 (gradient properties)] The bounded, zero-sum, and self-correcting properties of the projection gradients are asserted to follow from the decoupled projection step, but the derivation (likely §3.3 or §4) does not address how these properties survive the necessary approximations for LLM-scale response spaces. Without this, the geometric equivalence to group-based methods and the claimed superiority over first-order policy gradients remain unverified.

minor comments (2)

[§2] Notation for the response simplex and the listwise objective could be introduced earlier with an explicit definition of the target distribution to aid readability.
[§5] The experiments section would benefit from an ablation isolating the effect of the exact vs. approximate projection on the reported stability metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which help clarify the distinction between the theoretical framework and its practical realization for autoregressive LLMs. We address each major comment point by point below, proposing targeted revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3 (target-projection construction)] The central claim of monotonic improvement via exact divergence minimization (abstract and §3) rests on performing an exact argmin over the response simplex. For autoregressive LLMs the policy is token-level, so the induced full-response distribution is not directly parameterized; any practical implementation must rely on sampling or variational approximations. It is not shown that these preserve the exact projection, the zero-sum property, or the self-correcting gradient behavior required for the monotonicity guarantee.

Authors: We appreciate this observation on the gap between the idealized projection and LLM-scale implementation. Section 3 develops the exact target-projection operator over the response simplex to establish monotonicity and gradient properties in the continuous setting. In practice, LPO (like the group-based baselines) approximates the simplex via a finite set of sampled responses per prompt, performing the divergence minimization over this empirical support. This Monte Carlo approximation is explicitly used in the algorithm and experiments. The zero-sum property is preserved exactly within each sampled group by construction of the normalized target, while boundedness and self-correction hold approximately with variance controlled by group size. We will revise §3 to include a new paragraph and appendix derivation explicitly stating the sampling approximation, its effect on the guarantees (monotonicity in expectation), and why the geometric unification with group-based RLVR remains valid. revision: partial
Referee: [§3.3 (gradient properties)] The bounded, zero-sum, and self-correcting properties of the projection gradients are asserted to follow from the decoupled projection step, but the derivation (likely §3.3 or §4) does not address how these properties survive the necessary approximations for LLM-scale response spaces. Without this, the geometric equivalence to group-based methods and the claimed superiority over first-order policy gradients remain unverified.

Authors: We agree that the survival of these properties under approximation requires explicit treatment. The decoupled projection in LPO separates the target computation from the policy parameters, so that even with sampled responses the resulting gradient remains a normalized, zero-sum vector whose magnitude is bounded by the chosen divergence (e.g., KL or reverse KL). We will expand §3.3 with a short derivation showing that the self-correcting behavior (negative feedback when the policy overshoots the target) holds in expectation over the sampling distribution, and add empirical plots of gradient norms across training to verify stability. This addition will also reinforce the equivalence to implicit targets in group-based methods and the observed empirical gains under matched targets. revision: partial

Circularity Check

0 steps flagged

No circularity: explicit target-projection construction is independent of inputs

full rationale

The paper reinterprets group-based RLVR methods as implicitly defining targets on the response simplex and performing first-order projections, then introduces LPO as an explicit version by restricting the proximal objective to the simplex and performing exact divergence minimization. This geometric reframing and the resulting claims of monotonic improvement, bounded zero-sum gradients, and divergence flexibility are derived from the new decoupled projection step rather than reducing to fitted parameters, self-definitions, or author self-citations. The abstract and claimed derivation chain introduce independent structure (listwise objective, exact argmin) without tautological equivalence to the original group-relative advantages. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the assumption that the response space can be treated as a probability simplex and that first-order policy updates can be replaced by exact divergence projections without losing the original RLVR objective. No free parameters or invented entities are mentioned in the abstract.

axioms (2)

domain assumption The space of LLM responses for a given prompt forms a probability simplex on which a target distribution can be defined and projected onto.
Invoked when the paper states that group-based methods implicitly define targets on the response simplex.
domain assumption First-order policy gradient updates are equivalent to a proximal projection step toward an implicit target.
This is the core geometric insight used to motivate LPO.

pith-pipeline@v0.9.0 · 5543 in / 1481 out tokens · 31211 ms · 2026-05-08T13:51:51.705037+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 35 canonical work pages · 19 internal anchors

[1]

Neural computation , volume=

Natural gradient works efficiently in learning , author=. Neural computation , volume=. 1998 , publisher=

1998
[2]

Maximum a posteriori policy optimisation

Maximum a posteriori policy optimisation , author=. arXiv preprint arXiv:1806.06920 , year=

work page arXiv
[3]

Encyclopedia of Machine Learning , pages=

Kullback-leibler divergence , author=. Encyclopedia of Machine Learning , pages=
[4]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[5]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

1992
[6]

Journal of Machine Learning Research , volume=

Gflownet foundations , author=. Journal of Machine Learning Research , volume=
[7]

Learning , volume=

From ranknet to lambdarank to lambdamart: An overview , author=. Learning , volume=
[8]

Proceedings of the 24th international conference on Machine learning , pages=

Learning to rank: from pairwise approach to listwise approach , author=. Proceedings of the 24th international conference on Machine learning , pages=
[9]

Neural Computation , volume=

Using expectation-maximization for reinforcement learning , author=. Neural Computation , volume=. 1997 , publisher=

1997
[10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review arXiv
[11]

arXiv e-prints , pages=

Reinforce++: A simple and efficient approach for aligning large language models , author=. arXiv e-prints , pages=
[12]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Reinforcement learning and control as probabilistic inference: Tutorial and review , author=. arXiv preprint arXiv:1805.00909 , year=

work page internal anchor Pith review arXiv
[14]

Maximum likelihood reinforcement learning, 2026

Maximum Likelihood Reinforcement Learning , author=. arXiv preprint arXiv:2602.02710 , year=

work page arXiv
[15]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review arXiv
[16]

2018 , publisher=

Lectures on convex optimization , author=. 2018 , publisher=

2018
[17]

arXiv preprint arXiv:2602.01970 , year=

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models , author=. arXiv preprint arXiv:2602.01970 , year=

work page arXiv
[18]

2025 , eprint=

Group Sequence Policy Optimization , author=. 2025 , eprint=

2025
[19]

arXiv preprint arXiv:2502.18548 , year=

What is the Alignment Objective of GRPO? , author=. arXiv preprint arXiv:2502.18548 , year=

work page arXiv
[20]

Advances in neural information processing systems , volume=

A natural policy gradient , author=. Advances in neural information processing systems , volume=
[21]

arXiv preprint arXiv:1909.12238 , year=

V-mpo: On-policy maximum a posteriori policy optimization for discrete and continuous control , author=. arXiv preprint arXiv:1909.12238 , year=

work page arXiv 1909
[22]

Mirror descent policy optimization

Mirror descent policy optimization , author=. arXiv preprint arXiv:2005.09814 , year=

work page arXiv 2005
[23]

International conference on machine learning , pages=

A theory of regularized markov decision processes , author=. International conference on machine learning , pages=. 2019 , organization=

2019
[24]

Reinforcement learning with verifiable rewards: Grpo’s effective loss, dynamics, and success amplification.arXiv preprint arXiv:2503.06639,

Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification , author=. arXiv preprint arXiv:2503.06639 , year=

work page arXiv
[25]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review arXiv
[26]

Machine learning , volume=

Learning to predict by the methods of temporal differences , author=. Machine learning , volume=. 1988 , publisher=

1988
[27]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring Mathematical Problem Solving With the MATH Dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review arXiv
[28]

Qwen2.5-VL Technical Report

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review arXiv
[29]

Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning , author =. The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021) , year =

2021
[30]

Process Reinforcement through Implicit Rewards

Process reinforcement through implicit rewards , author=. arXiv preprint arXiv:2502.01456 , year=

work page internal anchor Pith review arXiv
[31]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. arXiv preprint arXiv:2402.14008 , year=

work page internal anchor Pith review arXiv
[32]

Advances in Neural Information Processing Systems , volume=

Solving quantitative reasoning problems with language models , author=. Advances in Neural Information Processing Systems , volume=
[33]

The Twelfth International Conference on Learning Representations , year=

Let's verify step by step , author=. The Twelfth International Conference on Learning Representations , year=
[34]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review arXiv
[35]

Prompt curriculum learning for efficient llm post-training, 2025

Prompt curriculum learning for efficient llm post-training , author=. arXiv preprint arXiv:2510.01135 , year=

work page arXiv
[36]

2017 , eprint=

Trust Region Policy Optimization , author=. 2017 , eprint=

2017
[37]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Search-r1: Training llms to reason and leverage search engines with reinforcement learning , author=. arXiv preprint arXiv:2503.09516 , year=

work page internal anchor Pith review arXiv
[38]

2026 , eprint=

Target Policy Optimization , author=. 2026 , eprint=

2026
[39]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review arXiv
[40]

POLARIS: A Post-Training Recipe for Scaling Reinforcement Learning on Advanced Reasoning Models , url =

An, Chenxin and Xie, Zhihui and Li, Xiaonan and Li, Lei and Zhang, Jun and Gong, Shansan and Zhong, Ming and Xu, Jingjing and Qiu, Xipeng and Wang, Mingxuan and Kong, Lingpeng , year =. POLARIS: A Post-Training Recipe for Scaling Reinforcement Learning on Advanced Reasoning Models , url =
[41]

Jiayi Pan and Junjie Zhang and Xingyao Wang and Lifan Yuan and Hao Peng and Alane Suhr , title =
[42]

arXiv preprint arXiv:2603.10887 , year=

Dynamics-predictive sampling for active RL finetuning of large reasoning models , author=. arXiv preprint arXiv:2603.10887 , year=

work page arXiv
[43]

arXiv preprint arXiv:2507.04632 , year=

Can prompt difficulty be online predicted for accelerating rl finetuning of reasoning models? , author=. arXiv preprint arXiv:2507.04632 , year=

work page arXiv
[44]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Advantage-weighted regression: Simple and scalable off-policy reinforcement learning , author=. arXiv preprint arXiv:1910.00177 , year=

work page internal anchor Pith review arXiv 1910
[45]

Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models.arXiv preprint arXiv:2310.10505, 2023

Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models , author=. arXiv preprint arXiv:2310.10505 , year=

work page arXiv
[46]

Notion Blog , year=

Deepcoder: A fully open-source 14b coder at o3-mini level , author=. Notion Blog , year=
[47]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Relative entropy policy search , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[48]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review arXiv
[49]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review arXiv
[50]

The American Statistician , volume=

A tutorial on MM algorithms , author=. The American Statistician , volume=. 2004 , publisher=

2004
[51]

Proceedings of the Royal Society of London

An invariant form for the prior probability in estimation problems , author=. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences , volume=. 1946 , publisher=

1946
[52]

Learning in graphical models , pages=

A view of the EM algorithm that justifies incremental, sparse, and other variants , author=. Learning in graphical models , pages=. 1998 , publisher=

1998
[53]

International conference on machine learning , pages=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

2018
[54]

IEEE Transactions on Information theory , volume=

Divergence measures based on the Shannon entropy , author=. IEEE Transactions on Information theory , volume=. 2002 , publisher=

2002
[55]

1959 , publisher=

Individual choice behavior , author=. 1959 , publisher=

1959
[56]

arXiv preprint arXiv:2511.06134 , year=

Maestro: Learning to Collaborate via Conditional Listwise Policy Optimization for Multi-Agent LLMs , author=. arXiv preprint arXiv:2511.06134 , year=

work page arXiv
[57]

Advances in Neural Information Processing Systems , volume=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=
[58]

First Conference on Language Modeling , year=

Gpqa: A graduate-level google-proof q&a benchmark , author=. First Conference on Language Modeling , year=
[59]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

work page internal anchor Pith review arXiv
[60]

Journal of the Royal Statistical Society Series C: Applied Statistics , volume=

The analysis of permutations , author=. Journal of the Royal Statistical Society Series C: Applied Statistics , volume=. 1975 , publisher=

1975
[61]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

2024
[62]

Yilun Du, Shuang Li, Antonio Torralba, Joshua B

It takes two: Your grpo is secretly dpo , author=. arXiv preprint arXiv:2510.00977 , year=

work page arXiv
[63]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review arXiv
[64]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Minimax-m1: Scaling test-time compute efficiently with lightning attention , author=. arXiv preprint arXiv:2506.13585 , year=

work page internal anchor Pith review arXiv
[65]

2024 , eprint=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2024 , eprint=

2024
[66]

Advances in neural information processing systems , volume=

Policy gradient methods for reinforcement learning with function approximation , author=. Advances in neural information processing systems , volume=
[67]

arXiv preprint arXiv:2509.15207 , year=

Flowrl: Matching reward distributions for llm reasoning , author=. arXiv preprint arXiv:2509.15207 , year=

work page arXiv
[68]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review arXiv
[69]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[70]

2025 , eprint=

LiPO: Listwise Preference Optimization through Learning-to-Rank , author=. 2025 , eprint=

2025
[71]

2010 , publisher=

Modeling purposeful adaptive behavior with the principle of maximum causal entropy , author=. 2010 , publisher=

2010
[72]

arXiv preprint arXiv:2309.16240 , year=

Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints , author=. arXiv preprint arXiv:2309.16240 , year=

work page arXiv
[73]

The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward, 2026

The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward , author=. arXiv preprint arXiv:2509.07430 , year=

work page arXiv
[74]

Reverse-KL Reinforcement Learning Can Sample From Multiple Diverse Modes , author=