Sparse but critical: A token-level analysis of distributional shifts in rlvr fine-tuning of llms

Haoming Meng, Kexin Huang, Shaohang Wei, Chiyu Ma, Shuo Yang, Xue Wang, Guoyin Wang, Bolin Ding, Jingren Zhou · 2026 · arXiv 2603.22446

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

Not only where, But when: Temporal Scheduling for RLVR

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.

Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.

Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

DelTA estimates token coefficients to amplify discriminative directions in token-gradient vectors, reweighting the RLVR surrogate to produce more contrastive side-wise centroids and yielding 3.26 and 2.62 point gains on math benchmarks for 8B and 14B Qwen3 models.

ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

cs.CV · 2026-05-19 · unverdicted · novelty 6.0 · 2 refs

ParaVT introduces the first multi-agent RL framework for parallel video tool calling in LMMs, using PARA-GRPO to resolve the Tool Prior Paradox and achieve +7.9% average improvement over Qwen3-VL baseline across six benchmarks.

Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

Proposes Near-boundary Stochastic Rescue (NSR) as a stochastic modification to clipping in RLVR that recovers near-boundary signals and yields gains over baselines like DAPO and GSPO.

One-Way Policy Optimization for Self-Evolving LLMs

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

OWPO decouples optimization direction from magnitude via asymmetric reweighting (Accelerated Alignment for inferior deviations, Gain Locking for superior) plus iterative references to create a ratchet effect for continuous LLM improvement.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR cs.LG · 2026-05-11 · unverdicted · none · ref 15
RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.

Sparse but critical: A token-level analysis of distributional shifts in rlvr fine-tuning of llms

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer