pith. sign in

← back to paper

Review history

arxiv: 2605.12483 · 4 revisions

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

  1. 2026-05-21 UNVERDICTED LOW v0.9.0 novelty 5.0
    57431 ms 8560 in 1406 out 2026-05-21T07:52:35.152181+00:00
  2. 2026-05-19 UNVERDICTED LOW v0.9.0 novelty 5.0
    32058 ms 5849 in 1135 out 2026-05-19T16:37:05.873569+00:00
  3. 2026-05-15 UNVERDICTED LOW v0.9.0 novelty 6.0
    56827 ms 5675 in 1235 out 2026-05-15T05:20:20.794990+00:00
  4. 2026-05-13 UNVERDICTED LOW v0.9.0 novelty 5.0
    38297 ms 5678 in 1205 out 2026-05-13T05:01:24.708241+00:00