OPD updates occupy a relaxed off-principal regime and rapidly lock into a low-dimensional subspace that is functionally sufficient for its performance, distinct from SFT and RLVR trajectories.
hub Canonical reference
The invisible leash: Why rlvr may or may not escape its origin
Canonical reference. 80% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
OrderGrad supplies unbiased likelihood-ratio and reparameterization gradient estimators for finite-sample L-statistics by applying a rank-based reward transformation usable in standard policy-gradient updates.
Extrapolative weight averaging of RL checkpoints trained under nested unit-test coverage extends a correctness-efficiency frontier and boosts ensemble pass rates in code generation across model scales and inference modes.
CurveRL derives a quantile-coordinate reweighting rule from a utility functional on pass rates and shows it outperforms GRPO on reasoning benchmarks.
GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.
HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.
VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
Introduces a hierarchical latent selection model showing SFT supplies raw module materials in compound traces while RL decomposes them to identify atomic modules and enable recombination for new reasoning configurations.
Proposes MaxPO using a Leave-Two-Out baseline for centered unbiased advantages in max@K policy gradients, with a unified derivation of finite-batch estimators.
REFT improves Pass@1/8/64 in RLVR by uniform first-token sampling from top-N candidates across 0.5B-7B models and multiple difficulty levels.
Coarser compressed CoT needs more SFT data, scales differently with repetition, and RL later breaks apart the compressed steps learned in SFT.
NudgeRL conditions RLVR rollouts on strategy-level contexts to drive diverse trajectories and applies an inter/intra-context reward decomposition plus distillation objective, outperforming GRPO and oracle baselines on math benchmarks.
MOPD improves on-policy distillation by using peer successes and failures from multiple rollouts to construct more informative teacher signals, yielding consistent gains over baselines on reasoning benchmarks.
LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.
Parallel inference rollouts aggregated into pseudo-references enable reference-free RL supervision that matches expert-annotated performance on health tasks while using 9x less test-time compute.
Closed-loop self-evolution on LLMs improves reasoning on Knights and Knaves tasks but plateaus short of oracle-supervised levels, with multi-turn revision nearly matching it for large models.
IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.
citing papers explorer
No citing papers match the current filters.