hub Canonical reference

Justrl: Scaling a 1.5 b llm with a simple rl recipe

Andre Wang He, Daniel Fried, Sean Welleck · 2025 · arXiv 2512.16649

Canonical reference. 80% of citing Pith papers cite this work as background.

15 Pith papers citing it

Background 80% of classified citations

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 1

citation-polarity summary

background 4 use method 1

representative citing papers

Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

SC-GRPO improves RL with verifiable rewards by multiplying GRPO gradients with self-induced per-token KL divergence, outperforming GRPO by 8.1% and DAPO by 5.9% on math, code, and agent benchmarks.

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

cs.CL · 2026-06-16 · unverdicted · novelty 7.0

ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.

OPRD: On-Policy Representation Distillation

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

OPRD performs distillation in hidden-state space on on-policy data for deterministic gradients and better math benchmark performance, plus OPRD-Bridge for cross-architecture transfer via low-rank projectors.

Fine-Tuning Small Reasoning Models for Quantum Field Theory

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.

Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

cs.LG · 2026-06-11 · unverdicted · novelty 6.0

On-policy distillation produces coordinate-sparse, FFN-heavy updates that are full-rank but spectrally concentrated away from principal singular subspaces and near-zero source weights.

Enhancing LLM Metacognition via Cognitive Pairwise Training

cs.LG · 2026-05-30 · unverdicted · novelty 6.0

CPT is introduced as a pairwise reasoning-trace comparison stage that improves the reasoning-metacognition trade-off over standard SFT+RL pipelines across model scales.

Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetry between high- and low-probability tokens.

Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.

Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.

On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training

cs.LG · 2026-01-12 · unverdicted · novelty 6.0

SFT and RL cannot be decoupled in LLM post-training because each step increases the loss or lowers the reward of the prior step under KL and PL analyses.

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

cs.LG · 2026-06-17 · unverdicted · novelty 5.0

STARE applies surprisal-guided token-level advantage reweighting plus a target-entropy gate to stabilize entropy in GRPO RL for LLMs, yielding stable training and 4-8% gains on AIME24/25 over baselines.

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

cs.CL · 2026-05-12 · unverdicted · novelty 5.0

On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.

On-Policy Distillation with Best-of-N Teacher Rollout Selection

cs.CV · 2026-05-10 · unverdicted · novelty 5.0 · 2 refs

BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.

STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

cs.CL · 2026-02-17

citing papers explorer

Showing 1 of 1 citing paper after filters.

On-Policy Distillation with Best-of-N Teacher Rollout Selection cs.CV · 2026-05-10 · unverdicted · none · ref 16 · 2 links
BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.

Justrl: Scaling a 1.5 b llm with a simple rl recipe

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer