The 1.7B RL’d-teacher rows are a same-size control that isolates the dense-reward effect from teacher scale

Raw, SFT rows test C1 by using the same transfer protocol without teacher-side sparse RL · 2024

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

cs.LG · 2026-05-12 · unverdicted · novelty 5.0

Sparse rewards on capable teachers for exploration followed by dense distillation to students outperforms direct sparse reward application like GRPO on the deployment model.

citing papers explorer

Showing 1 of 1 citing paper.

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training cs.LG · 2026-05-12 · unverdicted · none · ref 29
Sparse rewards on capable teachers for exploration followed by dense distillation to students outperforms direct sparse reward application like GRPO on the deployment model.

The 1.7B RL’d-teacher rows are a same-size control that isolates the dense-reward effect from teacher scale

fields

years

verdicts

representative citing papers

citing papers explorer