arxiv: 2604.28005 · v1 · submitted 2026-04-30 · 💻 cs.LG · stat.ML

Recognition: unknown

Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

Shijin Gong , Kai Ye , Jin Zhu , Xinyu Zhang , Hongyi Zhou , Chengchun Shi

Authors on Pith no claims yet

Pith reviewed 2026-05-07 06:54 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords kernel smoothingadvantage estimationLLM reasoningreinforcement learningpolicy optimizationvalue function estimationnonparametric statisticsGRPO

0 comments

The pith

Applying kernel smoothing to a small number of reasoning traces yields accurate value and gradient estimates that improve policy optimization in LLM reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the challenge of efficient reinforcement learning for improving LLM reasoning when computational resources limit the number of reasoning traces that can be sampled per prompt. Standard methods either train an expensive value network, sample many traces for averaging, or use single traces leading to high variance. By applying kernel smoothing to estimate the value function from few samples, the approach achieves accurate estimates both in theory and practice. This leads to lower-variance policy gradients and better optimization outcomes compared to baselines. A sympathetic reader would care because it offers a way to scale RL training for LLMs without proportional increases in compute or memory.

Core claim

The paper claims that kernel smoothing, drawn from nonparametric statistics, can be applied directly to a small number of sampled reasoning traces per prompt to estimate the value function in the high-dimensional discrete output space of LLMs. This produces accurate value estimates and low-variance policy gradients, which in turn support improved policy optimization, as demonstrated by both theoretical analysis and numerical experiments.

What carries the argument

Kernel smoothing applied to reasoning traces for nonparametric value function estimation in the discrete space of LLM outputs.

Load-bearing premise

Kernel smoothing applied to a small number of reasoning traces per prompt can produce sufficiently unbiased and low-variance estimates of the true value function in the high-dimensional, discrete space of LLM outputs.

What would settle it

Experiments on LLM reasoning benchmarks that show the kernel-based method produces no reduction in gradient variance or no gain in final policy performance relative to GRPO or REINFORCE when restricted to the same small number of traces per prompt would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.28005 by Chengchun Shi, Hongyi Zhou, Jin Zhu, Kai Ye, Shijin Gong, Xinyu Zhang.

**Figure 1.** Figure 1: Expected rewards of one-shot GRPO (Wang et al., 2025b), the oracle algorithm, and our method (denoted as KAE) on training (left) and testing (right) datasets in the one-shot regime where the training data consists of a single observation. One-shot GRPO applies the standard GRPO algorithm directly to this regime. Shaded areas represent confidence intervals. experiments to validate these advantages over both… view at source ↗

**Figure 2.** Figure 2: Illustrations of a generic algorithm that unifies A2C, REINFORCE- and GRPO-type algorithms. 1. The first approach is A2C, which introduces a critic function C(X) to serve as a baseline and replaces the reward Z with an advantage function A = Z − C(X) in constructing the policy gradient estimator gb(θ). Its main idea is that ∇θ log πθ(Y |X) is a score function, and thus multiplying it by any C(X) yields a … view at source ↗

**Figure 3.** Figure 3: MSE of KAE’s value estimator on the MATH dataset across three training steps under varying kernel bandwidths. The left and right panels visualize the MSEs under the triangular and exponential kernels, respectively. Horizontal lines denote the MSEs of REINFORCE++ and GRPO, which are independent of bandwidth and kernel function view at source ↗

**Figure 4.** Figure 4: Test accuracy of models post-trained with standard REINFORCE (blue), KAE (red), and a REINFORCE variant using the proposed prompt sampling scheme, on GSM8K (left) and MATH (right) across different training steps. Shaded areas represent the standard error of the accuracy curves, aggregated over five training replications. run. In contrast, since training on MATH is substantially more expensive, we report re… view at source ↗

read the original abstract

Recent advances in large language models (LLMs) have increasingly relied on reinforcement learning (RL) to improve their reasoning capabilities. Three approaches have been widely adopted: (i) Proximal policy optimization and advantage actor-critic rely on a deep neural network to estimate the value function of the learning policy in order to reduce the variance of the policy gradient. However, estimating and maintaining such a value network incurs substantial computational and memory overhead. (ii) Group relative policy optimization (GRPO) avoids training a value network by approximating the value function using sample averages. However, GRPO samples a large number of reasoning traces per prompt to achieve accurate value function approximation, making it computationally expensive. (iii) REINFORCE-type algorithms sample only a single reasoning trajectory per prompt, which reduces computational cost but suffers from poor sample efficiency. In this work, we focus on a practical, resource-constrained setting in which only a small number of reasoning traces can be sampled per prompt, while low-variance gradient estimation remains essential for high-quality policy learning. To address this challenge, we bring classical nonparametric statistical methods, which are both computationally and statistically efficient, to LLM reasoning. We employ kernel smoothing as a concrete example for value function estimation and the subsequent policy optimization. Numerical and theoretical results demonstrate that our proposal achieves accurate value and gradient estimation, leading to improved policy optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Kernel smoothing for advantage estimation in low-sample LLM RL is a practical idea that fills a real gap, but its value rests on whether the kernel actually induces useful similarity in discrete high-dimensional outputs.

read the letter

The main thing here is that the paper applies kernel smoothing from nonparametric statistics to estimate values and advantages in RL for LLM reasoning, specifically when only a few trajectories can be sampled per prompt. It positions this as a middle path between GRPO's heavy sampling for averaging and REINFORCE's single-trajectory noise, while skipping the cost of a learned value network. The authors claim both theory and experiments show accurate estimation and better policy optimization as a result. This is a direct, sensible import of an existing tool to a current bottleneck in LLM training. The framing of the three baseline approaches is clear and the practical constraint they target is real. What they do well is keep the method nonparametric and computationally light, which aligns with the resource limits in this domain. The idea itself is straightforward and does not overclaim novelty beyond the application. The soft spots are in the translation to LLM outputs. These are discrete, variable-length sequences in a combinatorially large space, so the kernel and its metric need to group trajectories that are close in value-relevant ways. If the smoothing does not capture semantic or reasoning similarity effectively, the estimator collapses toward the sample average and the claimed gains disappear. The abstract asserts supporting numerical and theoretical results, but without details on kernel choice, bandwidth selection, or how variable lengths are handled, it is hard to judge robustness. The stress-test point about effective dimension and token dependence is worth pressing in review. This paper is for people working on sample-efficient RL for language models. A reader focused on variance reduction techniques or nonparametric methods in RL would get something useful from the proposal. It deserves a serious referee because the problem is timely, the approach is concrete, and the claims are falsifiable with standard benchmarks. I would send it to peer review and ask for explicit ablations on the kernel's contribution plus checks that the theory holds under realistic output distributions.

Referee Report

3 major / 2 minor

Summary. The paper proposes kernelized advantage estimation, applying nonparametric kernel smoothing to value function estimation in reinforcement learning for LLM reasoning. In a resource-constrained setting with only a small number of reasoning traces per prompt, the method aims to achieve low-variance gradient estimates without training a value network (as in PPO) or requiring large per-prompt sample groups (as in GRPO), while avoiding the high variance of single-trajectory REINFORCE. The abstract states that numerical and theoretical results support accurate estimation and improved policy optimization.

Significance. If the kernel smoothing produces sufficiently unbiased and low-variance value estimates in the discrete space of LLM outputs, the approach could meaningfully reduce computational overhead in RL fine-tuning of LLMs while maintaining or improving sample efficiency and policy quality. It would represent a practical bridge between classical nonparametric statistics and modern LLM training pipelines, potentially enabling higher-quality reasoning improvements under tight sampling budgets.

major comments (3)

[Abstract] Abstract: the central claim that 'Numerical and theoretical results demonstrate that our proposal achieves accurate value and gradient estimation' is unsupported by any quantitative metrics, baseline comparisons, variance reduction factors, error bars, or specific convergence rates. The full manuscript must supply these to substantiate the improvement over GRPO and REINFORCE.
[Theoretical Analysis] The theoretical analysis must derive convergence rates that explicitly account for the effective dimension of the combinatorially large, discrete, variable-length space of LLM reasoning traces and the dependence structure among tokens. Standard nonparametric rates for continuous domains do not automatically transfer; without such rates the claim that kernel smoothing yields accurate estimates from few traces remains unproven.
[Experiments] Numerical experiments must demonstrate that the chosen kernel and metric induce meaningful smoothing over semantically similar trajectories on realistic LLM reasoning traces (rather than toy continuous domains). Direct comparisons to GRPO with matched small sample sizes per prompt are required to show that the estimator does not reduce to the sample-average baseline.

minor comments (2)

[Method] Specify the exact kernel function, bandwidth selection procedure, and distance metric over reasoning traces (including handling of variable lengths) in the method section for reproducibility.
[Introduction] Add a clear statement of the precise setting (number of traces per prompt, model sizes, benchmarks) in which the method is evaluated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The comments highlight important areas where the manuscript can be strengthened to better substantiate its claims. We address each major comment below and commit to revisions that will incorporate quantitative support, a more careful discussion of theoretical assumptions, and enhanced experimental validation on realistic LLM traces.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'Numerical and theoretical results demonstrate that our proposal achieves accurate value and gradient estimation' is unsupported by any quantitative metrics, baseline comparisons, variance reduction factors, error bars, or specific convergence rates. The full manuscript must supply these to substantiate the improvement over GRPO and REINFORCE.

Authors: We agree that the abstract claim would benefit from more explicit quantitative backing. In the revised manuscript we will add concrete metrics in the main body (and, if space permits, a brief mention in the abstract), including variance reduction factors relative to REINFORCE, error bars from multiple random seeds, and direct numerical comparisons against GRPO and REINFORCE under identical small per-prompt sample budgets. These additions will be supported by the existing numerical results together with new tabulated statistics. revision: yes
Referee: [Theoretical Analysis] The theoretical analysis must derive convergence rates that explicitly account for the effective dimension of the combinatorially large, discrete, variable-length space of LLM reasoning traces and the dependence structure among tokens. Standard nonparametric rates for continuous domains do not automatically transfer; without such rates the claim that kernel smoothing yields accurate estimates from few traces remains unproven.

Authors: The referee correctly identifies a gap. Our current analysis applies kernel smoothing in a continuous embedding space where standard nonparametric rates hold under standard regularity conditions on the kernel and the embedding metric. We will revise the theoretical section to (i) explicitly discuss the effective dimension induced by the chosen metric, (ii) address token-level dependence through the embedding, and (iii) either derive adapted convergence rates under suitable assumptions on the kernel or clearly delineate the limitations of the continuous-space transfer. If a complete derivation proves intractable within the scope of the paper, we will state this limitation transparently. revision: partial
Referee: [Experiments] Numerical experiments must demonstrate that the chosen kernel and metric induce meaningful smoothing over semantically similar trajectories on realistic LLM reasoning traces (rather than toy continuous domains). Direct comparisons to GRPO with matched small sample sizes per prompt are required to show that the estimator does not reduce to the sample-average baseline.

Authors: We will substantially expand the experimental section. New results will be presented on realistic LLM reasoning benchmarks using actual model-generated traces. We will include qualitative and quantitative evidence (e.g., similarity heatmaps or weight distributions) showing that the kernel assigns higher weights to semantically related trajectories. In addition, we will report head-to-head comparisons against GRPO using exactly the same small per-prompt sample sizes (e.g., 4–8 traces) to demonstrate that the kernel estimator yields lower variance and better policy performance than the plain sample-average baseline employed by GRPO. revision: yes

Circularity Check

0 steps flagged

No circularity: standard application of nonparametric kernel methods to LLM value estimation

full rationale

The paper presents kernel smoothing as a direct transfer of classical nonparametric statistics to estimate value functions from small numbers of LLM reasoning traces, contrasting it with neural value networks and sample-average GRPO. No equations, fitting procedures, or derivations are shown that define a claimed 'accurate estimation' or 'improved policy optimization' in terms of the same data or parameters used to evaluate it. The abstract and description invoke no self-citations as load-bearing uniqueness theorems, no ansatzes smuggled via prior work, and no renaming of known empirical patterns as new results. Theoretical convergence claims and numerical demonstrations are positioned as external validation rather than tautological reductions. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of classical kernel smoothing consistency results to the discrete, high-dimensional distribution of LLM reasoning traces; no explicit free parameters, invented entities, or additional axioms are named in the abstract.

axioms (1)

domain assumption Kernel smoothing produces consistent estimates of the conditional expectation (value function) when applied to a modest number of samples drawn from the LLM policy's output distribution.
The abstract invokes nonparametric statistical efficiency without stating conditions under which the LLM trace distribution satisfies the usual smoothness or density assumptions required for kernel consistency.

pith-pipeline@v0.9.0 · 5555 in / 1376 out tokens · 62808 ms · 2026-05-07T06:54:54.394441+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning Perturbations to Extrapolate Your LLM
stat.ML 2026-05 unverdicted novelty 6.0

A learnable continuous perturbation framework for LLM token prefixes via latent vector transformations, optimized through unbiased estimating equations, yields gains in out-of-domain performance.
Perturbation is All You Need for Extrapolating Language Models
stat.ML 2026-05 unverdicted novelty 6.0

Perturbing prefixes to semantic neighbors during training creates a hierarchical noise model that improves language model predictions on token sequences outside the training corpus support.

Reference graph

Works this paper leans on

33 extracted references · 30 canonical work pages · cited by 2 Pith papers · 11 internal anchors

[1]

Deep transferq-learning for offline non-stationary reinforcement learning.arXiv preprint arXiv:2501.04870,

Jinhang Chai, Elynn Chen, and Jianqing Fan. Deep transferq-learning for offline non-stationary reinforcement learning.arXiv preprint arXiv:2501.04870,

work page arXiv
[2]

Privacy-preserving reinforcement learning from human feed- back via decoupled reward modeling.arXiv preprint arXiv:2603.22563,

Young Hyun Cho and Will Wei Sun. Privacy-preserving reinforcement learning from human feed- back via decoupled reward modeling.arXiv preprint arXiv:2603.22563,

work page arXiv
[3]

(2025), Gpg: A simple and strong reinforcement learning baseline for model reasoning, arXiv preprint arXiv:2504.02546

Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, and Yong Wang. Gpg: A simple and strong reinforcement learning baseline for model reasoning.arXiv preprint arXiv:2504.02546,

work page arXiv
[4]

Cde: Curiosity-driven exploration for efficient reinforcement learning in large language models.arXiv preprint arXiv:2509.09675,

Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu, Haitao Mi, Zhaopeng Tu, Rui Liu, Tong Zheng, Hongtu Zhu, et al. Cde: Curiosity-driven exploration for efficient reinforcement learning in large language models.arXiv preprint arXiv:2509.09675,

work page arXiv
[5]

Statistical reinforcement learning in the real world: A survey of challenges and future directions.arXiv preprint arXiv:2601.15353,

Asim H Gazi, Yongyi Guo, Daiqi Gao, Ziping Xu, Kelly W Zhang, and Susan A Murphy. Statistical reinforcement learning in the real world: A survey of challenges and future directions.arXiv preprint arXiv:2601.15353,

work page arXiv
[6]

A Review of Causal Decision Making

Lin Ge, Hengrui Cai, Runzhe Wan, Yang Xu, and Rui Song. A review of causal decision making. arXiv preprint arXiv:2502.16156,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Ebpo: Empirical bayes shrinkage for stabilizing group-relative policy optimization.arXiv preprint arXiv:2602.05165,

Kevin Han, Yuhang Zhou, Mingze Gao, Gedi Zhou, Serena Li, Abhishek Kumar, Xiangjun Fan, Weiwei Li, and Lizhu Zhang. Ebpo: Empirical bayes shrinkage for stabilizing group-relative policy optimization.arXiv preprint arXiv:2602.05165,

work page arXiv
[8]

On-policy rl with optimal reward baseline.arXiv preprint arXiv:2505.23585, 2025

Yaru Hao, Li Dong, Xun Wu, Shaohan Huang, Zewen Chi, and Furu Wei. On-policy rl with optimal reward baseline.arXiv preprint arXiv:2505.23585,

work page arXiv
[9]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization.arXiv preprint arXiv:2501.03262,

work page internal anchor Pith review arXiv
[10]

The Implicit Curriculum: Learning Dynamics in RL with Verifiable Rewards

Yu Huang, Zixin Wen, Yuejie Chi, Yuting Wei, Aarti Singh, Yingbin Liang, and Yuxin Chen. On the learning dynamics of rlvr at the edge of competence.arXiv preprint arXiv:2602.14872,

work page internal anchor Pith review arXiv
[11]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Buy 4 reinforce samples, get a baseline for free! InICLR 2019 Workshop on Deep Reinforcement Learning Meets Structured Prediction,

Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 reinforce samples, get a baseline for free! InICLR 2019 Workshop on Deep Reinforcement Learning Meets Structured Prediction,

2019
[13]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

18 Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brah- man, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124,

work page internal anchor Pith review arXiv
[14]

J., Sun, W

Seong Jin Lee, Will Wei Sun, and Yufeng Liu. Low-rank contextual reinforcement learning from heterogeneous human feedback.arXiv preprint arXiv:2412.19436,

work page arXiv
[15]

arXiv preprint arXiv:2506.09340 (2025) 3

Siheng Li, Zhanhui Zhou, Wai Lam, Chao Yang, and Chaochao Lu. Repo: Replay-enhanced policy optimization.arXiv preprint arXiv:2506.09340, 2025a. Yu Li, Tian Lan, and Zhengling Qi. When right meets wrong: Bilateral context conditioning with reward-confidence correction for grpo.arXiv preprint arXiv:2603.13134, 2026b. Yuhan Li, Eugene Han, Yifan Hu, Zhenglin...

work page arXiv
[16]

Cppo: Accelerating the training of group relative policy optimization-based reasoning models.arXiv preprint arXiv:2503.22342,

Zhihang Lin, Mingbao Lin, Yuan Xie, and Rongrong Ji. Cppo: Accelerating the training of group relative policy optimization-based reasoning models.arXiv preprint arXiv:2503.22342,

work page arXiv
[17]

Statistical Impossibility and Possibility of Aligning LLMs with Human Preferences: From Condorcet Paradox to Nash Equilibrium

Kaizhao Liu, Qi Long, Zhekun Shi, Weijie J Su, and Jiancong Xiao. Statistical impossibility and possibility of aligning llms with human preferences: From condorcet paradox to nash equilibrium. arXiv preprint arXiv:2503.10990, 2025a. Pangpang Liu, Junwei Lu, and Will Wei Sun. Uncertainty quantification for large language model reward learning under heterog...

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Online estimation and inference for robust policy evaluation in reinforcement learning.The Annals of Statistics, 53(5):2128–2152, 2025c

Weidong Liu, Jiyuan Tu, Xi Chen, and Yichen Zhang. Online estimation and inference for robust policy evaluation in reinforcement learning.The Annals of Statistics, 53(5):2128–2152, 2025c. Zhaowei Liu, Xin Guo, Zhi Yang, Fangqi Lou, Lingfeng Zeng, Jinyi Niu, Mengping Li, Qi Qi, Zhiqiang Liu, Yiyang Han, et al. Fin-r1: A large language model for financial r...

work page arXiv
[19]

Asynchronous methods for deep reinforcement learning

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. InInternational Conference on Machine Learning, pages 1928–1937. PMLR,

1928
[20]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review arXiv
[21]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review arXiv
[22]

Statistically efficient advantage learning for offline reinforcement learning in infinite horizons.Journal of the American Statistical Association, 119(545):232–245, 2024a

Chengchun Shi, Shikai Luo, Yuan Le, Hongtu Zhu, and Rui Song. Statistically efficient advantage learning for offline reinforcement learning in infinite horizons.Journal of the American Statistical Association, 119(545):232–245, 2024a. Chengchun Shi, Zhengling Qi, Jianin g Wang, and Fan Zhou. Value enhancement of reinforcement learning via efficient and ro...

2011
[23]

Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning

Hu Wang, Congbo Ma, Ian Reid, and Mohammad Yaqub. Kalman filter enhanced grpo for rein- forcement learning-based language model reasoning.arXiv preprint arXiv:2505.07527, 2025a. Jiayi Wang, Zhengling Qi, and Raymond KW Wong. Projected state-action balancing weights for offline reinforcement learning.The Annals of Statistics, 51(4):1639–1665,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Reinforcement learning for reasoning in large language models with one training example, 2025

Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, et al. Reinforcement learning for reasoning in large language models with one training example.arXiv preprint arXiv:2504.20571, 2025b. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et ...

work page arXiv
[25]

arXiv preprint arXiv:2602.08259 , year=

Xintao Xia, Zhiqiu Xia, Linjun Zhang, and Zhanrui Cai. A statistical framework for alignment with biased ai feedback.arXiv preprint arXiv:2602.08259,

work page arXiv
[26]

A minimalist approach to llm reasoning: from rejection sampling to reinforce, 2025

Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, et al. A minimalist approach to llm reasoning: from rejection sampling to reinforce.arXiv preprint arXiv:2504.11343,

work page arXiv
[27]

Single-stream policy optimization.arXiv preprint arXiv:2509.13232,

Zhongwen Xu and Zihan Ding. Single-stream policy optimization.arXiv preprint arXiv:2509.13232,

work page arXiv
[28]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review arXiv
[29]

Shrinking the variance: Shrink- age baselines for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2511.03710,

21 Guanning Zeng, Zhaoyi Zhou, Daman Arora, and Andrea Zanette. Shrinking the variance: Shrink- age baselines for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2511.03710,

work page arXiv
[30]

arXiv preprint arXiv:2507.20673 , year=

Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shao- han Huang, Lei Cui, Qixiang Ye, et al. Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673,

work page arXiv
[31]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025a. Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, and Dong Yu. Parallel-r1: Towards parallel...

work page internal anchor Pith review arXiv
[32]

arXiv preprint arXiv:2603.01162v3 , year=

Doudou Zhou, Yufeng Zhang, Aaron Sonabend-W, Zhaoran Wang, Junwei Lu, and Tianxi Cai. Federated offline reinforcement learning.Journal of the American Statistical Association, 119 (548):3152–3163, 2024a. Hongyi Zhou, Kai Ye, Erhan Xu, Jin Zhu, Ying Yang, Shijin Gong, and Chengchun Shi. Demys- tifying group relative policy optimization: Its policy gradient...

work page arXiv
[33]

Align: Aligned delegation with performance guarantees for multi-agent llm reasoning

Wenzhuo Zhou, Ruoqing Zhu, and Annie Qu. Estimating optimal infinite horizon dynamic treat- ment regimes via pt-learning.Journal of the American Statistical Association, 119(545):625–638, 2024b. Tong Zhu, Baiting Chen, Jin Zhou, Hua Zhou, Sriram Sankararaman, and Xiaowu Dai. Align: Aligned delegation with performance guarantees for multi-agent llm reasoni...

work page arXiv