pith. sign in

Canonical reference

Justrl: Scaling a 1.5 b llm with a simple rl recipe

Canonical reference. 80% of citing Pith papers cite this work as background.

13 Pith papers citing it
Background 80% of classified citations

citation-role summary

background 4 method 1

citation-polarity summary

years

2026 13

clear filters

representative citing papers

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

cs.CL · 2026-06-16 · unverdicted · novelty 7.0

ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.

OPRD: On-Policy Representation Distillation

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

OPRD performs distillation in hidden-state space on on-policy data for deterministic gradients and better math benchmark performance, plus OPRD-Bridge for cross-architecture transfer via low-rank projectors.

Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetry between high- and low-probability tokens.

On-Policy Distillation with Best-of-N Teacher Rollout Selection

cs.CV · 2026-05-10 · unverdicted · novelty 5.0 · 2 refs

BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

  • On-Policy Distillation with Best-of-N Teacher Rollout Selection cs.CV · 2026-05-10 · unverdicted · none · ref 16 · 2 links

    BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.