ROSD: Reflective On-Policy Self-Distillation for Language Model Reasoning across Domains

Daiting Shi; Jingzhou He; Liu Yang; Xiao-Ming Wu; Xin Xin; Xinyu Ma; Yujie Feng; Zhaochun Ren; Ziqi Zhao

arxiv: 2605.28014 · v1 · pith:QDRCQ3OSnew · submitted 2026-05-27 · 💻 cs.CL · cs.LG

ROSD: Reflective On-Policy Self-Distillation for Language Model Reasoning across Domains

Ziqi Zhao , Xinyu Ma , Liu Yang , Yujie Feng , Daiting Shi , Jingzhou He , Xin Xin , Zhaochun Ren

show 1 more author

Xiao-Ming Wu

This is my paper

Pith reviewed 2026-06-29 13:01 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords on-policy self-distillationlanguage model reasoningout-of-domain generalizationreflective distillationerror localizationself-reflectorLLM reasoning

0 comments

The pith

ROSD improves LLM reasoning by using a self-reflector to extract a corrective idea and restrict distillation to the first error span in each rollout.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard on-policy self-distillation conditions the teacher on verified solutions and applies updates across the entire response, which encourages imitation of reference trajectories and can overwrite valid reasoning steps. ROSD replaces this with a self-reflector that identifies a corrective idea and the initial erroneous span, then guides the teacher only on that span. The result is error-specific correction that keeps good prefixes intact. Experiments across in-domain and out-of-domain reasoning benchmarks show stronger overall in-domain results and markedly better generalization than prior OPSD approaches.

Core claim

ROSD turns reference-solution imitation into targeted reasoning correction through reflection-guided, error-localized distillation. For each rollout, a self-reflector extracts a corrective idea and locates the first erroneous span. The corrective idea guides the self-teacher toward targeted supervision, while the localized error span restricts distillation to where correction is needed. This design corrects flawed reasoning while preserving valid prefixes.

What carries the argument

The self-reflector that extracts a corrective idea and locates the first erroneous span, enabling error-localized distillation instead of full-response updates.

If this is right

Stronger in-domain reasoning performance overall than standard OPSD.
Substantially better out-of-domain generalization than standard OPSD.
Preservation of valid reasoning prefixes during the distillation step.
Shift from imitation of reference trajectories to error-specific correction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reflection-plus-localization pattern could be tested on non-reasoning sequence tasks such as code generation or dialogue.
If the reflector works reliably, the method reduces dependence on large sets of verified reference solutions.
Error localization may lower the risk of reinforcing domain-specific patterns that hurt generalization.

Load-bearing premise

A self-reflector can reliably extract a corrective idea and accurately locate the first erroneous span without introducing new errors or misidentifying valid steps.

What would settle it

An ablation that supplies the self-reflector with deliberately incorrect error locations or corrective ideas, after which out-of-domain gains over standard OPSD disappear.

Figures

Figures reproduced from arXiv: 2605.28014 by Daiting Shi, Jingzhou He, Liu Yang, Xiao-Ming Wu, Xin Xin, Xinyu Ma, Yujie Feng, Zhaochun Ren, Ziqi Zhao.

**Figure 1.** Figure 1: Left: Pilot study. We use SDPO (Hübotter et al., 2026) as a representative OPSD method. All methods are trained on the Material dataset with Qwen3-8B, and evaluated on Material and ToolUse as the in-domain and out-of-domain benchmarks, respectively. Right: Comparison of post-training methods. Green and red tokens denote the valid prefix before the first error and the error-affected suffix, respectively. (a… view at source ↗

**Figure 2.** Figure 2: Self-reflection prompt templates used by ROSD. The system prompt provides task-level instructions to [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: In domain training dynamics on Material and Chemistry. We report the mean@16 test score, and the [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Rolling average of rollout accuracy over 10 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Error localization dynamics during training. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Training efficiency. We report time(seconds) [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Response length dynamics. A rolling average [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: System prompt for ScienceQA. [User] {question} Please reason step by step [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 10.** Figure 10: User prompt for ToolUse. [User] {question} Please reason step by step, and put your final answer within \boxed{} [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 12.** Figure 12: Teacher input template with reflection. The self-teacher preserves the original system and user prompts [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

read the original abstract

On-policy self-distillation (OPSD) improves the reasoning performance of large language models (LLMs) by providing dense token-level supervision for on-policy rollouts. However, existing OPSD methods often yield limited gains on in-domain reasoning and generalize poorly to out-of-domain problems. We identify two key causes: conditioning the self-teacher on a verified solution encourages imitation of training-domain reference trajectories rather than error-specific correction, and applying distillation to the full response can overwrite valid reasoning prefixes and reinforce overfitting. We propose Reflective On-policy Self-Distillation (ROSD), a framework that turns reference-solution imitation into targeted reasoning correction through reflection-guided, error-localized distillation. For each rollout, ROSD uses a self-reflector to extract a corrective idea and locate the first erroneous span. The corrective idea guides the self-teacher toward targeted supervision, while the localized error span restricts distillation to where correction is needed. This design corrects flawed reasoning while preserving valid prefixes. Experiments on multiple in-domain and out-of-domain reasoning benchmarks show that ROSD yields stronger in-domain reasoning performance overall and substantially better out-of-domain generalization than standard OPSD. Code is available at https://github.com/ZiqiZhao1/ROSD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ROSD adds reflection-guided error localization to on-policy distillation, but the reflector's reliability is unverified and the claimed gains rest on that untested step.

read the letter

The main takeaway is that ROSD tries to improve on-policy self-distillation by using a self-reflector to pull a corrective idea from the reference solution and then restrict the distillation loss to the first erroneous span in the model's rollout. This directly targets the two problems they name: full-response imitation of training-domain trajectories and overwriting valid prefixes.

The new piece is the combination of reflection for idea extraction plus span localization to make the supervision targeted rather than blanket. That is a distinct move past the OPSD methods cited in the abstract, and the logic for preserving good prefixes while correcting errors is straightforward.

The experiments are described as showing better in-domain results and notably stronger out-of-domain generalization, which would matter if the numbers check out.

The soft spot is the one the stress-test note flags. The whole advantage depends on the reflector actually extracting a usable corrective idea and correctly identifying the first bad span at a high enough rate. If it fails often, the method either collapses back to standard imitation or starts corrupting correct prefixes. The abstract lays out the mechanism but supplies no quantitative checks—no agreement rates, no ablation on reflector quality, no error-localization precision numbers. That leaves the central claim without direct support.

The paper is aimed at people working on LLM reasoning and distillation methods. A reader already following OPSD work would get value from the concrete design choice, even if they have to run their own checks on the reflector.

I would send it to peer review so the experimental details and any reflector validation can be examined.

Referee Report

2 major / 2 minor

Summary. The paper claims that standard on-policy self-distillation (OPSD) yields limited in-domain gains and poor out-of-domain generalization because it encourages imitation of reference trajectories and applies distillation over full responses. ROSD addresses this by introducing a self-reflector that extracts a corrective idea from the reference and identifies the first erroneous span in each rollout; distillation is then restricted to that span under guidance from the corrective idea. Experiments on multiple in-domain and out-of-domain reasoning benchmarks are reported to show stronger overall in-domain performance and substantially better out-of-domain generalization than standard OPSD. Code is released.

Significance. If the central mechanism proves reliable, the targeted correction approach could meaningfully advance on-policy distillation methods for LLM reasoning by reducing overfitting to training-domain trajectories while preserving valid prefixes. The public code release is a clear strength that supports reproducibility.

major comments (2)

[Abstract, §3] Abstract and §3 (method description): the central claim that restricting distillation to the first erroneous span 'preserves valid prefixes' and supplies 'targeted correction' is load-bearing, yet the manuscript supplies no quantitative evidence (human agreement rates, precision/recall on span localization, or ablation removing the reflector) that the self-reflector meets the required reliability threshold at non-negligible rates.
[§4] §4 (experiments): the reported gains in out-of-domain generalization are attributed to error-localized distillation, but without an ablation that compares ROSD against a version using the same corrective idea yet full-response distillation, it is unclear whether the localization step, rather than the corrective idea alone, drives the improvement.

minor comments (2)

[§3] Notation for the self-reflector output (corrective idea and span) should be formalized with explicit symbols to avoid ambiguity when describing the modified distillation loss.
[Abstract, §4] The abstract states 'multiple in-domain and out-of-domain reasoning benchmarks' but does not name them; the experimental section should list the exact datasets and splits used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We respond point by point to the major comments and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (method description): the central claim that restricting distillation to the first erroneous span 'preserves valid prefixes' and supplies 'targeted correction' is load-bearing, yet the manuscript supplies no quantitative evidence (human agreement rates, precision/recall on span localization, or ablation removing the reflector) that the self-reflector meets the required reliability threshold at non-negligible rates.

Authors: We agree that the manuscript does not provide direct quantitative validation of the self-reflector's span localization (e.g., human agreement rates, precision/recall) or an ablation that removes the reflector. End-to-end gains serve as indirect evidence in the current version. In revision we will add both an ablation removing the reflector and human evaluation of localization accuracy on a sampled subset of rollouts. revision: yes
Referee: [§4] §4 (experiments): the reported gains in out-of-domain generalization are attributed to error-localized distillation, but without an ablation that compares ROSD against a version using the same corrective idea yet full-response distillation, it is unclear whether the localization step, rather than the corrective idea alone, drives the improvement.

Authors: We concur that the experiments do not isolate the localization step from the corrective idea. We will add the requested ablation (corrective idea with full-response distillation) in the revised manuscript to clarify the contribution of error localization. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent experimental validation

full rationale

The paper introduces ROSD as a modification to OPSD that incorporates a self-reflector for corrective idea extraction and error-span localization, then reports benchmark results showing improved in-domain and out-of-domain performance. No derivation step reduces by construction to its inputs: there are no self-definitional equations, no parameters fitted on a subset and then relabeled as predictions, and no load-bearing self-citations or uniqueness theorems. The mechanism is described procedurally and evaluated externally via experiments, making the central claims falsifiable and non-circular. The reader's assessment of score 1.0 aligns with this finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is abstract-only; no explicit free parameters, axioms, or invented entities are detailed in the provided text.

axioms (1)

domain assumption Standard assumptions of on-policy distillation and LLM fine-tuning hold for the proposed reflective mechanism.
The method extends OPSD without challenging base training assumptions.

pith-pipeline@v0.9.1-grok · 5776 in / 1167 out tokens · 52755 ms · 2026-06-29T13:01:02.592910+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

UCOB: Learning to Utilize and Evolve Agentic Skills via Credit-Aware On-Policy Bidirectional Self-Distillation
cs.AI 2026-06 unverdicted novelty 5.0

UCOB improves agentic RL by using return-to-go comparisons between skill-conditioned and no-skill prompts as local teachers for bidirectional self-distillation and skill memory updates.

Reference graph

Works this paper leans on

9 extracted references · 5 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Self-distillation zero: Self-revision turns bi- nary rewards into dense supervision.arXiv preprint arXiv:2604.12002. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531. Jonas Hübotter, Frederike Lübeck, Lejs Behric, An- ton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Id...

work page internal anchor Pith review Pith/arXiv arXiv 2015
[2]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783. Kevin Lu and Thinking Machines Lab. 2025. On- policy distillation.Thinking Machines Lab: Con- nectionism. Https://thinkingmachines.ai/blog/on- policy-distillation. Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wen- ha...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning. CoRR. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow in- structions with human feedback.Advances in neural informat...

2022
[4]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open langua...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Feng Yao, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. 2025. Your efficient rl framework secretly brings you off-policy rl training, august 2025.URL https://fengyao. notion. site/off- policy-rl. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

We train for 10 epochs on the science question-answering tasks and 5 epochs on the ToolUse task, using a training batch size of

that train under a fixed time budget, we use an epoch budget large enough for the methods to converge in our setting. We train for 10 epochs on the science question-answering tasks and 5 epochs on the ToolUse task, using a training batch size of
[8]

All algorithms are eval- uated every 10 training steps

The maximum length of the reflection prompt is 8k tokens, and the maximum length of the gener- ated reflection is 4k tokens. All algorithms are eval- uated every 10 training steps. For self-distillation methods, we use Jensen–Shannon divergence with α= 0.5 , a distillation top-k of 100, and a frozen self-teacher and self-reflector during each run. C Addit...

2026
[9]

Given a question and four options, please select the right answer

The prompt used as input to the self-teacher is shown in Figure 12. Given a question and four options, please select the right answer. Respond in the following format: <reasoning> ... </reasoning> <answer> ... </answer> For the answer, only output the letter corresponding to the correct option (A, B, C, or D), and nothing else. Do not restate the answer t...

[1] [1]

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Self-distillation zero: Self-revision turns bi- nary rewards into dense supervision.arXiv preprint arXiv:2604.12002. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531. Jonas Hübotter, Frederike Lübeck, Lejs Behric, An- ton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Id...

work page internal anchor Pith review Pith/arXiv arXiv 2015

[2] [2]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783. Kevin Lu and Thinking Machines Lab. 2025. On- policy distillation.Thinking Machines Lab: Con- nectionism. Https://thinkingmachines.ai/blog/on- policy-distillation. Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wen- ha...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning. CoRR. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow in- structions with human feedback.Advances in neural informat...

2022

[4] [4]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open langua...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Feng Yao, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. 2025. Your efficient rl framework secretly brings you off-policy rl training, august 2025.URL https://fengyao. notion. site/off- policy-rl. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

We train for 10 epochs on the science question-answering tasks and 5 epochs on the ToolUse task, using a training batch size of

that train under a fixed time budget, we use an epoch budget large enough for the methods to converge in our setting. We train for 10 epochs on the science question-answering tasks and 5 epochs on the ToolUse task, using a training batch size of

[8] [8]

All algorithms are eval- uated every 10 training steps

The maximum length of the reflection prompt is 8k tokens, and the maximum length of the gener- ated reflection is 4k tokens. All algorithms are eval- uated every 10 training steps. For self-distillation methods, we use Jensen–Shannon divergence with α= 0.5 , a distillation top-k of 100, and a frozen self-teacher and self-reflector during each run. C Addit...

2026

[9] [9]

Given a question and four options, please select the right answer

The prompt used as input to the self-teacher is shown in Figure 12. Given a question and four options, please select the right answer. Respond in the following format: <reasoning> ... </reasoning> <answer> ... </answer> For the answer, only output the letter corresponding to the correct option (A, B, C, or D), and nothing else. Do not restate the answer t...