arxiv: 2605.12058 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI

Recognition: no theorem link

H\"older Policy Optimisation

Yuxiang Chen , Dingli Liang , Yihang Chen , Ziqin Gong , Chenyang Le , Zhaokai Wang , Jiachen Zhu , Lingyu Yang

show 3 more authors

Jianghao Lin Weinan Zhang Jun Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Hölder meanpolicy optimizationGRPOreinforcement learninglarge language modelsgradient aggregationmathematical reasoning

0 comments

The pith

HölderPO resolves GRPO's aggregation trade-off by using a tunable Hölder mean with annealed parameter p to control gradient concentration and variance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that fixed aggregation of token probabilities in Group Relative Policy Optimisation creates either training collapse or suboptimal performance. It shows that the Hölder mean offers continuous modulation via the parameter p, where higher values concentrate gradients on sparse signals and lower values bound variance. A dynamic annealing schedule for p adapts this balance throughout training. This leads to improved stability, convergence, and empirical results including 54.9 percent average accuracy on mathematical benchmarks and 93.8 percent success rate on ALFWorld.

Core claim

The central claim is that mapping trajectory advantages to policy updates in GRPO is limited by fixed aggregation mechanisms, but the Hölder mean unifies them and allows explicit control: larger p concentrates the gradient to amplify learning signals while smaller p strictly bounds gradient variance. Since no static p resolves the trade-off universally, dynamic annealing of p across the training lifecycle produces superior stability and performance over baselines.

What carries the argument

The Hölder mean applied to token-level probability aggregation within each trajectory, where the parameter p determines the degree of gradient concentration versus variance bounding.

Load-bearing premise

That modulating the single scalar p via annealing will reliably resolve the concentration-stability trade-off across different model sizes, tasks, and sampling budgets without introducing new failure modes.

What would settle it

A direct comparison of the annealed schedule against the best fixed p value, measured by whether the fixed version matches or exceeds the reported 54.9 percent average accuracy on the mathematical benchmarks.

Figures

Figures reproduced from arXiv: 2605.12058 by Chenyang Le, Dingli Liang, Jiachen Zhu, Jianghao Lin, Jun Wang, Lingyu Yang, Weinan Zhang, Yihang Chen, Yuxiang Chen, Zhaokai Wang, Ziqin Gong.

**Figure 1.** Figure 1: HölderPO unifies token-level aggregation under a single parameter p. The objective at the top generalises GRPO by replacing its arithmetic mean over token-level importance ratios with the Hölder mean of order p ∈ R, recovering GRPO (p = 1) and GMPO/GSPO (p → 0) as special cases. The bar chart reports accuracy on AIME24 (blue, sparse signal) and MATH500 (red, dense signal), with dashed lines marking GRPO ba… view at source ↗

**Figure 2.** Figure 2: Token-level importance ratio log ρt(θ) during training. Left and Right track the per-step upper and lower envelopes respectively. As p decreases, the upper envelope drops and the lower envelope rises, tightening the gap monotonically. Our decaying schedule p: 2→−2 (solid green) thus enables aggressive updates in the early stage and progressively converges to stable optimization in the later stage. Constant… view at source ↗

**Figure 3.** Figure 3: Entropy and gradient-norm dynamics under different Hölder exponents p. Columns: Math (Qwen2.5-Math-7B on MATH-12k) and Alfworld (Qwen2.5-1.5B). Rows: per-step policy entropy and gradient norm ∥∇L∥ (log scale on Math, linear on Alfworld). Constant-p baselines (p ∈ {+2, 0, −2}, dashed/dotted/dash-dotted) are compared with our linearly-decaying schedule p: 2→−2 (solid green). Positive p concentrates mass on h… view at source ↗

read the original abstract

Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose \textbf{H\"{o}lderPO}, a generalised policy optimisation framework unifying token-level probability aggregation via the H\"{o}lder mean. By explicitly modulating the parameter $p$, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger $p$ concentrates the gradient to amplify sparse learning signals, whereas a smaller $p$ strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules $p$ across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of $54.9\%$ across multiple mathematical benchmarks, yielding a substantial $7.2\%$ relative gain over standard GRPO and secures an exceptional $93.8\%$ success rate on ALFWorld.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HölderPO unifies GRPO token aggregation under the Hölder mean with an annealing schedule on p, delivering noticeable gains on math and ALFWorld tasks, but the schedule itself needs isolated controls to rule out extra tuning effects.

read the letter

HölderPO replaces the fixed aggregation step in GRPO with the Hölder mean family, letting a single parameter p control how much the gradient concentrates on strong tokens versus staying bounded in variance. They add an annealing schedule that starts at large p and moves toward smaller p over training, since no single fixed value works across regimes. The theory covers the two extremes cleanly: larger p amplifies sparse signals while smaller p limits variance. Empirically they report 54.9% average accuracy on the math benchmarks, a 7.2% relative improvement over plain GRPO, and 93.8% success on ALFWorld. Those lifts are large enough to matter for groups running token-level policy gradients on reasoning tasks. The unification itself is the cleanest part; it turns a set of discrete hacks into one continuous knob with matching bounds. The soft spot sits with the annealing schedule. If that schedule was tuned on the same math and ALFWorld suites used for the final numbers, the reported gains could partly reflect extra hyperparameter search rather than an intrinsic property of the Hölder framework. We need direct comparisons of fixed-p runs against the annealed version, plus tests on held-out tasks, to see how much the dynamic step actually contributes. The abstract mentions a proof on gradient concentration but the visible text does not lay out the steps, so that also needs checking. This work is aimed at labs doing group-based RL on LLMs where aggregation stability is a recurring headache. I would bring it to a reading group to walk through the Hölder bounds and the schedule details. It deserves peer review because the core idea is grounded and the numbers are big enough that referees can verify the controls and derivations.

Referee Report

3 major / 2 minor

Summary. The paper proposes HölderPO as a generalization of Group Relative Policy Optimisation (GRPO) for LLM policy optimization. It replaces fixed aggregation of token-level probabilities with the Hölder mean, controlled by a scalar parameter p that trades off gradient concentration (large p) against variance bounds (small p). Theoretical claims establish the limiting behaviors of p, and a dynamic annealing schedule is introduced because no fixed p works universally. Experiments report 54.9% average accuracy on mathematical benchmarks (+7.2% relative to GRPO) and 93.8% success on ALFWorld.

Significance. If the gradient-concentration and variance-bound results are rigorously derived and the empirical gains survive proper ablations and statistical controls, the framework would supply a continuous, theoretically grounded knob for the stability-performance trade-off that appears in many group-based RL methods for LLMs. The explicit unification via Hölder means and the annealing instantiation are the primary contributions.

major comments (3)

[§3] §3 (theoretical analysis): the claim that larger p concentrates the gradient to amplify sparse signals and that smaller p strictly bounds variance is asserted without the key intermediate inequalities or proof steps; the load-bearing step is the mapping from the Hölder mean definition to the advantage-weighted policy gradient, which must be shown explicitly under the GRPO group-relative advantage estimator.
[Experimental section] Experimental section, results table for mathematical benchmarks: the reported 54.9% average and 7.2% relative gain lack error bars, number of random seeds, and any ablation that isolates the annealing schedule from fixed-p Hölder means or from extra hyper-parameter search; without these controls the claim that dynamic p “resolves” the trade-off cannot be distinguished from post-hoc fitting on the same evaluation suites.
[§4.2] §4.2 (annealing algorithm): the specific functional form and timing of the p schedule constitute an additional free parameter whose selection procedure is not described; if this schedule was tuned on the math and ALFWorld data used for final reporting, the performance numbers are at risk of circularity and the universality claim is under-supported.

minor comments (2)

[§2] The Hölder-mean formula itself should be written explicitly once (e.g., as Eq. (3)) rather than left implicit, to aid readers unfamiliar with the family.
[Figures] Figure captions for training curves should state the exact p-annealing schedule used in each run.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed report. We address each major comment below and will incorporate revisions to improve the rigor and clarity of the manuscript.

read point-by-point responses

Referee: [§3] §3 (theoretical analysis): the claim that larger p concentrates the gradient to amplify sparse signals and that smaller p strictly bounds variance is asserted without the key intermediate inequalities or proof steps; the load-bearing step is the mapping from the Hölder mean definition to the advantage-weighted policy gradient, which must be shown explicitly under the GRPO group-relative advantage estimator.

Authors: We agree that the theoretical claims in §3 require more explicit intermediate steps. In the revised manuscript we will expand the section to include the full derivation: starting from the definition of the Hölder mean, we will show the explicit mapping to the advantage-weighted policy gradient under the GRPO estimator, followed by the inequalities establishing gradient concentration for large p and strict variance bounds for small p. revision: yes
Referee: Experimental section, results table for mathematical benchmarks: the reported 54.9% average and 7.2% relative gain lack error bars, number of random seeds, and any ablation that isolates the annealing schedule from fixed-p Hölder means or from extra hyper-parameter search; without these controls the claim that dynamic p “resolves” the trade-off cannot be distinguished from post-hoc fitting on the same evaluation suites.

Authors: We will revise the experimental section to report error bars computed across multiple random seeds, state the number of seeds employed, and add ablation studies that compare the dynamic annealing schedule against fixed-p Hölder means while controlling for hyper-parameter search effort. These additions will allow readers to isolate the contribution of the annealing procedure. revision: yes
Referee: [§4.2] §4.2 (annealing algorithm): the specific functional form and timing of the p schedule constitute an additional free parameter whose selection procedure is not described; if this schedule was tuned on the math and ALFWorld data used for final reporting, the performance numbers are at risk of circularity and the universality claim is under-supported.

Authors: We will expand §4.2 to give the exact functional form of the p schedule, the timing parameters, and the procedure used to select them. The schedule was chosen via preliminary runs on a held-out validation split distinct from the final reported benchmarks; we will state this explicitly and list the concrete hyper-parameters to remove any ambiguity about circularity. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the HölderPO derivation

full rationale

The paper's chain begins with an empirical observation of trade-offs in fixed aggregations for GRPO, introduces the Hölder mean (a pre-existing mathematical object) to unify them, derives theoretical bounds on gradient concentration for large p and variance for small p, and motivates dynamic annealing of p because no fixed value resolves the trade-off universally. None of these steps reduce by construction to the final benchmark numbers; the annealing schedule is an explicit algorithmic choice whose specific form is not shown to be equivalent to fitting on the reported math or ALFWorld results. The performance claims are presented as downstream experimental outcomes rather than tautological consequences of the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the mathematical properties of the Hölder mean (standard) and on the empirical claim that a particular annealing schedule for p improves stability and performance (ad-hoc to the paper). No new physical entities are postulated.

free parameters (1)

p
Continuous parameter of the Hölder mean that controls the aggregation; its schedule is chosen by the authors.

axioms (1)

standard math Hölder mean properties hold for the token-probability vectors arising in LLM policy gradients
Invoked when claiming gradient concentration for large p and variance bounds for small p.

pith-pipeline@v0.9.0 · 5579 in / 1336 out tokens · 44164 ms · 2026-05-13T05:58:47.841499+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 21 internal anchors

[1]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[4]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

work page 2025
[5]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

arXiv preprint arXiv:2507.20673 , year=

Geometric-mean policy optimization , author=. arXiv preprint arXiv:2507.20673 , year=

work page arXiv
[8]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement , author=. arXiv preprint arXiv:2409.12122 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Advances in neural information processing systems , volume=

Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=

work page
[11]

Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and Zhang, Yuxiang and Liu, Jie and Qi, Lei and Liu, Zhiyuan and Sun, Maosong. O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems. Proceedings of the ...

work page doi:10.18653/v1/2024.acl-long.211 2024
[12]

2024 , publisher =

Jia, LI and Edward, Beeching and Lewis, Tunstall and Ben, Lipkin and Roman, Soletskyi and Shengyi Costa, Huang and Kashif, Rasul and Longhui, Yu and Albert, Jiang and Ziju, Shen and Zihan, Qin and Bin, Dong and Li, Zhou and Yann, Fleureau and Guillaume, Lample and Stanislas, Polu , title =. 2024 , publisher =

work page 2024
[13]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Alfworld: Aligning text and embodied environments for interactive learning , author=. arXiv preprint arXiv:2010.03768 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010
[14]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

work page 1992
[15]

Advances in Neural Information Processing Systems (NIPS) , volume=

Policy gradient methods for reinforcement learning with function approximation , author=. Advances in Neural Information Processing Systems (NIPS) , volume=

work page
[16]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Group-in-Group Policy Optimization for LLM Agent Training

Group-in-group policy optimization for llm agent training , author=. arXiv preprint arXiv:2505.10978 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Advances in neural information processing systems , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=

work page
[19]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms , author=. arXiv preprint arXiv:2506.14245 , year=

work page internal anchor Pith review arXiv
[20]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

(2025), Gpg: A simple and strong reinforcement learning baseline for model reasoning, arXiv preprint arXiv:2504.02546

Gpg: A simple and strong reinforcement learning baseline for model reasoning , author=. arXiv preprint arXiv:2504.02546 , year=

work page arXiv
[24]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Margin

AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Momentum , author=. arXiv preprint arXiv:2505.14264 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

arXiv preprint arXiv:2506.02864 , year=

Bnpo: Beta normalization policy optimization , author=. arXiv preprint arXiv:2506.02864 , year=

work page arXiv
[27]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model , author=. arXiv preprint arXiv:2503.24290 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Process Reinforcement through Implicit Rewards

Process reinforcement through implicit rewards , author=. arXiv preprint arXiv:2502.01456 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

What’s behind ppo’s collapse in long-cot? value optimization holds the secret, 2025

What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret , author=. arXiv preprint arXiv:2503.01491 , year=

work page arXiv
[30]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Token-level proximal policy optimization for query generation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[31]

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild , author=. arXiv preprint arXiv:2503.18892 , year=

work page internal anchor Pith review arXiv
[32]

Advancing llm rea- soning generalists with preference trees.arXiv preprint arXiv:2404.02078, 2024

Advancing llm reasoning generalists with preference trees , author=. arXiv preprint arXiv:2404.02078 , year=

work page arXiv
[33]

SIAM review , volume=

Optimization methods for large-scale machine learning , author=. SIAM review , volume=. 2018 , publisher=

work page 2018
[34]

International conference on machine learning , pages=

Trust region policy optimization , author=. International conference on machine learning , pages=. 2015 , organization=

work page 2015
[35]

1976 , publisher=

Principles of Mathematical Analysis , author=. 1976 , publisher=

work page 1976
[36]

arXiv preprint arXiv:2601.22521 , year=

One Ring to Rule Them All: Unifying Group-Based RL via Dynamic Power-Mean Geometry , author=. arXiv preprint arXiv:2601.22521 , year=

work page arXiv
[37]

ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models

ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models , author=. arXiv preprint arXiv:2603.28204 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

arXiv preprint arXiv:2508.03772 , year=

Gtpo: Stabilizing group relative policy optimization via gradient and entropy control , author=. arXiv preprint arXiv:2508.03772 , year=

work page arXiv
[39]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning , author=. arXiv preprint arXiv:2506.01939 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Tgrpo: Fine-tuning vision-language-action model via trajectory-wise group relative policy optimization.arXiv preprint arXiv:2506.08440, 2025d

Tgrpo: Fine-tuning vision-language-action model via trajectory-wise group relative policy optimization , author=. arXiv preprint arXiv:2506.08440 , year=

work page arXiv
[41]

arXiv preprint arXiv:2505.12929 , year=

Do not let low-probability tokens over-dominate in rl for llms , author=. arXiv preprint arXiv:2505.12929 , year=

work page arXiv
[42]

arXiv preprint arXiv:2510.03669 , year=

Token hidden reward: Steering exploration-exploitation in group relative deep reinforcement learning , author=. arXiv preprint arXiv:2510.03669 , year=

work page arXiv
[43]

arXiv preprint arXiv:2510.09369 , year=

Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood , author=. arXiv preprint arXiv:2510.09369 , year=

work page arXiv
[44]

Advances in Neural Information Processing Systems , volume=

Lima: Less is more for alignment , author=. Advances in Neural Information Processing Systems , volume=

work page
[45]

The Twelfth International Conference on Learning Representations , year=

Rain: Your language models can align themselves without finetuning , author=. The Twelfth International Conference on Learning Representations , year=

work page
[46]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in

Yue, Yang and Chen, Zhiqi and Lu, Rui and Zhao, Andrew and Wang, Zhaokai and Song, Shiji and Huang, Gao , booktitle=. Does Reinforcement Learning Really Incentivize Reasoning Capacity in

work page
[47]

Rewarding the Unlikely: Lifting

He, Andre Wang and Fried, Daniel and Welleck, Sean , booktitle=. Rewarding the Unlikely: Lifting. 2025 , publisher=

work page 2025
[48]

Advances in Neural Information Processing Systems , volume=

Star: Bootstrapping reasoning with reasoning , author=. Advances in Neural Information Processing Systems , volume=

work page
[49]

Let's Verify Step by Step

Let's verify step by step , author=. arXiv preprint arXiv:2305.20050 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Advances in Neural Information Processing Systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[51]

arXiv preprint arXiv:2510.06870 , year=

lambda -GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences , author=. arXiv preprint arXiv:2510.06870 , year=

work page arXiv
[52]

On-policy rl with optimal reward baseline.arXiv preprint arXiv:2505.23585, 2025

On-policy rl with optimal reward baseline , author=. arXiv preprint arXiv:2505.23585 , year=

work page arXiv
[53]

Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346, 2025

Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization , author=. arXiv preprint arXiv:2505.12346 , year=

work page arXiv
[54]

2018 , publisher=

Reinforcement learning: An introduction , author=. 2018 , publisher=

work page 2018
[55]

International conference on machine learning , pages=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018
[56]

Spurious rewards: Rethinking training signals in RLVR.arXiv preprint arXiv:2506.10947,

Spurious Rewards: Rethinking Training Signals in RLVR , author=. arXiv preprint arXiv:2506.10947 , year=

work page arXiv
[57]

2018 , publisher=

High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=

work page 2018
[58]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Transformer Feed-Forward Layers Are Key-Value Memories , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021
[59]

Transformer Circuits Thread , year=

Towards monosemanticity: Decomposing language models with dictionary learning , author=. Transformer Circuits Thread , year=

work page