(2026), Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic

Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic , author= · 2026 · arXiv 2603.01162

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

representative citing papers

Learning Perturbations to Extrapolate Your LLM

stat.ML · 2026-05-13 · unverdicted · novelty 6.0

A learnable continuous perturbation framework for LLM token prefixes via latent vector transformations, optimized through unbiased estimating equations, yields gains in out-of-domain performance.

Perturbation is All You Need for Extrapolating Language Models

stat.ML · 2026-05-05 · unverdicted · novelty 6.0

Perturbing prefixes to semantic neighbors during training creates a hierarchical noise model that improves language model predictions on token sequences outside the training corpus support.

DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment

cs.LG · 2026-05-05 · unverdicted · novelty 6.0 · 2 refs

DGPO is a critic-free RL framework that uses bounded Hellinger distance and entropy-gated advantage redistribution to enable fine-grained token-level credit assignment in long CoT generations for LLM alignment, reporting SOTA results on AIME benchmarks.

Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

cs.LG · 2026-04-30 · unverdicted · novelty 6.0

Kernel smoothing yields accurate value and gradient estimates for low-variance policy learning in LLM reasoning under tight per-prompt sampling budgets.

Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks

cs.CL · 2026-04-03 · unverdicted · novelty 6.0

RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.

PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing, recommendation, and protein tasks.

Reinforcement Learning from Human Feedback: A Statistical Perspective

stat.ML · 2026-04-02 · accept · novelty 2.0

A statistical survey of RLHF for LLM alignment that connects preference learning and policy optimization to models like Bradley-Terry-Luce while reviewing methods, extensions, and open challenges.

citing papers explorer

Showing 7 of 7 citing papers.

Learning Perturbations to Extrapolate Your LLM stat.ML · 2026-05-13 · unverdicted · none · ref 12
A learnable continuous perturbation framework for LLM token prefixes via latent vector transformations, optimized through unbiased estimating equations, yields gains in out-of-domain performance.
Perturbation is All You Need for Extrapolating Language Models stat.ML · 2026-05-05 · unverdicted · none · ref 11
Perturbing prefixes to semantic neighbors during training creates a hierarchical noise model that improves language model predictions on token sequences outside the training corpus support.
DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment cs.LG · 2026-05-05 · unverdicted · none · ref 38 · 2 links
DGPO is a critic-free RL framework that uses bounded Hellinger distance and entropy-gated advantage redistribution to enable fine-grained token-level credit assignment in long CoT generations for LLM alignment, reporting SOTA results on AIME benchmarks.
Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning cs.LG · 2026-04-30 · unverdicted · none · ref 32
Kernel smoothing yields accurate value and gradient estimates for low-variance policy learning in LLM reasoning under tight per-prompt sampling budgets.
Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks cs.CL · 2026-04-03 · unverdicted · none · ref 33
RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents cs.LG · 2026-05-07 · unverdicted · none · ref 54
PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing, recommendation, and protein tasks.
Reinforcement Learning from Human Feedback: A Statistical Perspective stat.ML · 2026-04-02 · accept · none · ref 94
A statistical survey of RLHF for LLM alignment that connects preference learning and policy optimization to models like Bradley-Terry-Luce while reviewing methods, extensions, and open challenges.

(2026), Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic

fields

years

verdicts

representative citing papers

citing papers explorer