pith. sign in

Sparse but critical: A token-level analysis of distributional shifts in rlvr fine-tuning of llms

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

citation-role summary

method 1

citation-polarity summary

years

2026 9

verdicts

UNVERDICTED 9

roles

method 1

polarities

use method 1

clear filters

representative citing papers

Tandem Reinforcement Learning with Verifiable Rewards

cs.AI · 2026-06-26 · unverdicted · novelty 7.0

TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.

Not only where, But when: Temporal Scheduling for RLVR

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.

One-Way Policy Optimization for Self-Evolving LLMs

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

OWPO decouples optimization direction from magnitude via asymmetric reweighting (Accelerated Alignment for inferior deviations, Gain Locking for superior) plus iterative references to create a ratchet effect for continuous LLM improvement.

citing papers explorer

Showing 1 of 1 citing paper after filters.