Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

Hangjie Yuan; Jing Jin; Leqi Zheng; Tao Feng; Wenrui Zhou; Xing Hu; Xuchang Zhong; Yongzi Yu; Yuying Li

arxiv: 2606.02684 · v2 · pith:XW454VBUnew · submitted 2026-06-01 · 💻 cs.LG · cs.AI· cs.CL

Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

Yuying Li , Leqi Zheng , Yongzi Yu , Wenrui Zhou , Xuchang Zhong , Xing Hu , Jing Jin , Hangjie Yuan

show 1 more author

Tao Feng

This is my paper

classification 💻 cs.LG cs.AIcs.CL

keywords fire-opdoptimizationsupervisionthentrajectoriesdistillationfiltergranularity

0 comments

read the original abstract

On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which tokens are most informative, and which supervision signals are most reliable. Motivated by this trend, we rethink optimization granularity of OPD and propose \fireicon\ FiRe-OPD (Filter, then Reweight), which jointly adjusts supervision signals at both trajectory and token levels. In details, FiRe-OPD first filters trajectories to remove low-quality rollout samples, and then applies soft reweighting within the retained trajectories to emphasize informative tokens. Compared with hard token selection, FiRe-OPD leverages a soft-weighting mechanism to effectively mitigate information loss and enhance optimization stability, thereby achieving finer-grained OPD optimization. We validate the effectiveness of FiRe-OPD across strong-to-weak, single-teacher, and multi-teacher settings, and demonstrate its superiority over recent token-level OPD methods ( (e.g., +6.25 on AIME 2024 in strong-to-weak, +18.81 on Miner in multi-teacher). Our code is available at https://github.com/YuYingLi0/FiRe-OPD.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Blockwise Policy-Drift Gating for On-Policy Distillation
cs.LG 2026-06 unverdicted novelty 5.0

Blockwise policy-drift gating raises mean pass@8 from 0.4978 to 0.5160 on four math benchmarks by reweighting OPD losses with detached mean-normalized gates from student policy drift over 64-token blocks.
A Formula-Driven Survey and Research Agenda for On-Policy Distillation
cs.AI 2026-06 unverdicted novelty 4.0

A survey creates a taxonomy for on-policy distillation in LLMs that separates temporal credit assignment from vocabulary-level probability routing.