GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.
hub Canonical reference
The Invisible Leash: Why RLVR may or may not escape its origin
Canonical reference. 80% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.
VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
NudgeRL conditions RLVR rollouts on strategy-level contexts to drive diverse trajectories and applies an inter/intra-context reward decomposition plus distillation objective, outperforming GRPO and oracle baselines on math benchmarks.
LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.
Parallel inference rollouts aggregated into pseudo-references enable reference-free RL supervision that matches expert-annotated performance on health tasks while using 9x less test-time compute.
IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.
citing papers explorer
-
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning
GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.
-
Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR
HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.
-
Reinforcement Learning via Value Gradient Flow
VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
-
Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR
NudgeRL conditions RLVR rollouts on strategy-level contexts to drive diverse trajectories and applies an inter/intra-context reward decomposition plus distillation objective, outperforming GRPO and oracle baselines on math benchmarks.
-
Generalization in LLM Problem Solving: The Case of the Shortest Path
LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.
-
Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision
Parallel inference rollouts aggregated into pseudo-references enable reference-free RL supervision that matches expert-annotated performance on health tasks while using 9x less test-time compute.
-
How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors
IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.
-
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
-
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
-
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
-
Polychromic Objectives for Reinforcement Learning
Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.
- Multi-Rollout On-Policy Distillation via Peer Successes and Failures
- Policy Improvement Reinforcement Learning