TOPD improves on-policy distillation for LLM reasoning by using near-future guidance to identify divergent states, raising average accuracy from 47.8% to 52.2% on math benchmarks including AIME24 and AIME25.
IEEE Transactions on Emerging Topics in Computational Intelligence , year =
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CL 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Rock Tokens in on-policy distillation persist at high loss, account for up to 18% of outputs, absorb large gradient norms, but add negligible value to reasoning performance.
citing papers explorer
-
Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance
TOPD improves on-policy distillation for LLM reasoning by using near-future guidance to identify divergent states, raising average accuracy from 47.8% to 52.2% on math benchmarks including AIME24 and AIME25.
-
Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation
Rock Tokens in on-policy distillation persist at high loss, account for up to 18% of outputs, absorb large gradient norms, but add negligible value to reasoning performance.