[2026b] propose RL-aware distillation through advantage-aware selective imitation during PPO/GRPO-style updates

formulate KD as entropy-regularized value optimization with on-policy, off-policy demonstrations, while Zhang et al · 2026

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

cs.LG · 2026-05-12 · unverdicted · novelty 5.0

Sparse RL on a strong teacher followed by dense distillation to the student outperforms direct GRPO on the student for math tasks, with a forward-KL + OPD bridge enabling further gains.

citing papers explorer

Showing 1 of 1 citing paper.

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training cs.LG · 2026-05-12 · unverdicted · none · ref 27
Sparse RL on a strong teacher followed by dense distillation to the student outperforms direct GRPO on the student for math tasks, with a forward-KL + OPD bridge enabling further gains.

[2026b] propose RL-aware distillation through advantage-aware selective imitation during PPO/GRPO-style updates

fields

years

verdicts

representative citing papers

citing papers explorer