hub

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du · 2025 · cs.LG · arXiv 2503.20783

80 Pith papers cite this work. Polarity classification is still indexing.

80 Pith papers citing it

open full Pith review browse 80 citing papers arXiv PDF

abstract

DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at https://github.com/sail-sg/understand-r1-zero.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

claims ledger

abstract DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggestin

co-cited works

representative citing papers

StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

cs.SE · 2026-05-12 · unverdicted · novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.

DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.

The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interventions that enhance performance.

DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deployment issue.

Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.

Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.

Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token count by 55% on TIMIT.

AffectGPT-RL: Revealing Roles of Reinforcement Learning in Open-Vocabulary Emotion Recognition

cs.HC · 2026-05-07 · unverdicted · novelty 7.0

AffectGPT-RL applies reinforcement learning to optimize non-differentiable emotion wheel metrics in open-vocabulary multimodal emotion recognition, yielding performance gains and state-of-the-art results on basic emotion recognition benchmarks.

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

cs.CL · 2026-04-09 · unverdicted · novelty 7.0

OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.

PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent

cs.AI · 2026-04-08 · unverdicted · novelty 7.0

PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

cs.LG · 2026-01-26 · unverdicted · novelty 7.0

A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better token efficiency.

Group-in-Group Policy Optimization for LLM Agent Training

cs.LG · 2025-05-16 · unverdicted · novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.

Tracing Uncertainty in Language Model "Reasoning"

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Uncertainty trace profiles from LM reasoning traces predict correct final answers with AUROC up to 0.807 and enable early error detection using only initial tokens.

Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction

cs.LG · 2026-04-19 · unverdicted · novelty 7.0

EqLen is a sample-construction framework that builds equal-length paired segments via dual-track generation and masking for stable group-relative RL in sequences, reframing the length problem as a comparison-unit issue rather than loss scaling.

Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning

cs.CL · 2026-04-18 · unverdicted · novelty 7.0

Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-policy baselines on agentic tasks.

Debate as Reward: A Multi-Agent Reward System for Scientific Ideation via RL Post-Training

cs.AI · 2026-04-17 · unverdicted · novelty 7.0

A multi-agent binary reward system with unbiased GRPO post-training on ICLR-320 data outperforms baselines on expert-rated novelty, feasibility, and effectiveness for scientific idea generation.

Foresight Optimization for Strategic Reasoning in Large Language Models

cs.CL · 2026-04-15 · unverdicted · novelty 7.0

FoPO trains LLMs for strategic reasoning by combining self-interest with opponent modeling in policy optimization, yielding gains on two new datasets and better out-of-domain generalization than standard baselines.

DeonticBench: A Benchmark for Reasoning over Rules

cs.CL · 2026-04-06 · unverdicted · novelty 7.0

DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.

Teacher-Guided Policy Optimization for LLM Distillation

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.

When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.

H\"older Policy Optimisation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.

citing papers explorer

Showing 50 of 80 citing papers.

StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning cs.SE · 2026-05-12 · unverdicted · none · ref 30 · internal anchor
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why cs.LG · 2026-05-11 · unverdicted · none · ref 3 · internal anchor
Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning cs.CL · 2026-05-11 · unverdicted · none · ref 15 · internal anchor
DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.
The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits cs.LG · 2026-05-09 · unverdicted · none · ref 11 · internal anchor
The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interventions that enhance performance.
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules cs.AI · 2026-05-09 · unverdicted · none · ref 22 · internal anchor
DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deployment issue.
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective cs.LG · 2026-05-08 · unverdicted · none · ref 30 · internal anchor
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.
Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR cs.LG · 2026-05-08 · unverdicted · none · ref 6 · internal anchor
HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients cs.CL · 2026-05-07 · unverdicted · none · ref 17 · internal anchor
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents cs.AI · 2026-05-07 · unverdicted · none · ref 20 · internal anchor
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization cs.LG · 2026-05-07 · unverdicted · none · ref 34 · internal anchor
PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token count by 55% on TIMIT.
AffectGPT-RL: Revealing Roles of Reinforcement Learning in Open-Vocabulary Emotion Recognition cs.HC · 2026-05-07 · unverdicted · none · ref 37 · internal anchor
AffectGPT-RL applies reinforcement learning to optimize non-differentiable emotion wheel metrics in open-vocabulary multimodal emotion recognition, yielding performance gains and state-of-the-art results on basic emotion recognition benchmarks.
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models cs.CL · 2026-04-09 · unverdicted · none · ref 14 · internal anchor
OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent cs.AI · 2026-04-08 · unverdicted · none · ref 11 · internal anchor
PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models cs.LG · 2026-01-26 · unverdicted · none · ref 9 · internal anchor
A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better token efficiency.
Group-in-Group Policy Optimization for LLM Agent Training cs.LG · 2025-05-16 · unverdicted · none · ref 19 · internal anchor
GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.
Tracing Uncertainty in Language Model "Reasoning" cs.LG · 2026-05-08 · unverdicted · none · ref 22
Uncertainty trace profiles from LM reasoning traces predict correct final answers with AUROC up to 0.807 and enable early error detection using only initial tokens.
Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction cs.LG · 2026-04-19 · unverdicted · none · ref 6
EqLen is a sample-construction framework that builds equal-length paired segments via dual-track generation and masking for stable group-relative RL in sequences, reframing the length problem as a comparison-unit issue rather than loss scaling.
Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning cs.CL · 2026-04-18 · unverdicted · none · ref 11
Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-policy baselines on agentic tasks.
Debate as Reward: A Multi-Agent Reward System for Scientific Ideation via RL Post-Training cs.AI · 2026-04-17 · unverdicted · none · ref 2
A multi-agent binary reward system with unbiased GRPO post-training on ICLR-320 data outperforms baselines on expert-rated novelty, feasibility, and effectiveness for scientific idea generation.
Foresight Optimization for Strategic Reasoning in Large Language Models cs.CL · 2026-04-15 · unverdicted · none · ref 6
FoPO trains LLMs for strategic reasoning by combining self-interest with opponent modeling in policy optimization, yielding gains on two new datasets and better out-of-domain generalization than standard baselines.
DeonticBench: A Benchmark for Reasoning over Rules cs.CL · 2026-04-06 · unverdicted · none · ref 19
DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
Teacher-Guided Policy Optimization for LLM Distillation cs.LG · 2026-05-13 · unverdicted · none · ref 12 · internal anchor
TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy cs.CV · 2026-05-12 · unverdicted · none · ref 43 · internal anchor
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.
H\"older Policy Optimisation cs.LG · 2026-05-12 · unverdicted · none · ref 6 · internal anchor
HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation cs.CL · 2026-05-12 · unverdicted · none · ref 42 · 2 links · internal anchor
On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning cs.AI · 2026-05-12 · unverdicted · none · ref 21 · internal anchor
Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
UNIPO: Unified Interactive Visual Explanation for RL Fine-Tuning Policy Optimization cs.HC · 2026-05-12 · unverdicted · none · ref 19 · internal anchor
UNIPO is the first unified interactive visualization tool exposing token-level training dynamics of RL fine-tuning algorithms for LLMs through high-level overviews, step inspectors, and side-by-side comparisons.
fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum cs.LG · 2026-05-12 · unverdicted · none · ref 6 · internal anchor
FG-ExPO improves GRPO by adaptively scaling the KL penalty with batch accuracy and sampling questions via a Gaussian centered at 0.5 accuracy, delivering up to 13.34 point gains on AIME 2025 pass@32.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping cs.CV · 2026-05-11 · unverdicted · none · ref 71 · internal anchor
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.
G-Zero: Self-Play for Open-Ended Generation from Zero Data cs.LG · 2026-05-11 · unverdicted · none · ref 20 · internal anchor
G-Zero uses the Hint-δ intrinsic reward to drive co-evolution between a Proposer and Generator via GRPO and DPO, providing a theoretical suboptimality guarantee for self-improvement from internal dynamics alone.
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs cs.AI · 2026-05-09 · unverdicted · none · ref 24 · internal anchor
OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
Structured Recurrent Mixers for Massively Parallelized Sequence Generation cs.CL · 2026-05-09 · unverdicted · none · ref 84 · internal anchor
Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
AIPO: : Learning to Reason from Active Interaction cs.CL · 2026-05-08 · unverdicted · none · ref 44 · internal anchor
AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, then drops the agents at inference.
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex cs.LG · 2026-05-07 · unverdicted · none · ref 15 · internal anchor
LPO reframes group-based RLVR as explicit target-projection on the LLM response simplex and performs exact divergence minimization to achieve monotonic listwise improvement with bounded gradients.
ZAYA1-8B Technical Report cs.AI · 2026-05-06 · unverdicted · none · ref 143 · internal anchor
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
Binary Rewards and Reinforcement Learning: Fundamental Challenges cs.LG · 2026-05-04 · unverdicted · none · ref 15 · internal anchor
Binary rewards make the set of reward-maximizing policies infinite in policy gradients; KL control selects the filtered base model but misspecification drives collapse to concentrated valid outputs instead.
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression cs.CL · 2026-05-02 · unverdicted · none · ref 23 · 2 links · internal anchor
A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning cs.LG · 2026-05-01 · unverdicted · none · ref 5 · internal anchor
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
Co-Evolving Policy Distillation cs.LG · 2026-04-29 · unverdicted · none · ref 2 · internal anchor
CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific experts on text-image-video reasoning.
From Local Indices to Global Identifiers: Generative Reranking for Recommender Systems via Global Action Space cs.IR · 2026-04-28 · unverdicted · none · ref 28 · internal anchor
GloRank reformulates list-wise reranking as token generation over a global item identifier space, using supervised pre-training followed by reinforcement learning to maximize list-wise utility and outperforming baselines on benchmarks and industrial data.
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance cs.CL · 2026-04-25 · unverdicted · none · ref 18 · internal anchor
Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math and code tasks.
TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only cs.CL · 2026-04-21 · unverdicted · none · ref 5 · internal anchor
TRN-R1-Zero is an RL-only post-training method that lets LLMs perform zero-shot node, edge, and graph reasoning on text-rich networks without supervised data or larger-model distillation.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation cs.LG · 2026-04-14 · unverdicted · none · ref 25 · 2 links · internal anchor
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
Calibration-Aware Policy Optimization for Reasoning LLMs cs.LG · 2026-04-14 · unverdicted · none · ref 20 · internal anchor
CAPO improves LLM calibration by up to 15% while matching or exceeding GRPO accuracy through logistic AUC loss and noise masking, enabling better abstention and scaling performance.
Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO cs.LG · 2026-04-14 · unverdicted · none · ref 10 · internal anchor
Balanced Aggregation fixes sign-length coupling and length downweighting in GRPO by computing separate token means for positive and negative subsets and combining them with sequence-count weights, yielding more stable training and higher benchmark scores.
SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks cs.AI · 2026-04-10 · unverdicted · none · ref 2 · internal anchor
SPPO enables stable, sample-efficient alignment of LLMs on long-horizon reasoning tasks by using a decoupled scalar value function for low-variance advantages without multi-sampling.
Asymmetric Advantage Modulation Calibrates Entropy Dynamics in RLVR cs.CL · 2026-04-06 · conditional · none · ref 3 · 2 links · internal anchor
AsymGRPO decouples positive and negative advantage modulation in RLVR to separately boost useful entropy and suppress noisy entropy, improving LLM reasoning performance.
Can LLMs Learn to Reason Robustly under Noisy Supervision? cs.LG · 2026-04-05 · conditional · none · ref 16 · internal anchor
Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning benchmarks even at high noise levels.
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention cs.CL · 2025-06-16 · unverdicted · none · ref 19 · internal anchor
MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training in three weeks on 512 GPUs.
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model cs.CV · 2025-04-10 · unverdicted · none · ref 34 · internal anchor
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.

Understanding R1-Zero-Like Training: A Critical Perspective

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer