TOPD improves on-policy distillation for LLM reasoning by using near-future guidance to identify divergent states, raising average accuracy from 47.8% to 52.2% on math benchmarks including AIME24 and AIME25.
arXiv preprint arXiv:2603.14602 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1