pith. sign in

arxiv: 2606.11627 · v1 · pith:5TWTH5PLnew · submitted 2026-06-10 · 💻 cs.LG · cs.AI

When Context Returns: Toward Robust Internalization in On-Policy Distillation

Pith reviewed 2026-06-27 10:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords on-policy distillationcontext internalizationcontext-induced degradationcontext removabilityconsistency regularizerstop-gradientforward KL divergence
0
0 comments X

The pith

A stop-gradient consistency regularizer makes distilled models stable when privileged context returns at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

On-policy distillation internalizes privileged context such as system prompts so the student no longer needs it at test time, yet reintroducing that context often degrades performance on tasks the student already solves correctly without it. The paper identifies this context-induced degradation and defines the desired property of context removability, where the model behaves identically with or without the context. The proposed fix adds one extra forward pass that anchors the no-context output via stop-gradient and penalizes any deviation in the context-conditioned output through forward KL divergence. Across twelve configurations the regularizer improves context-conditioned accuracy in most cases, cuts context-induced harm in eleven of twelve, and removes response-length inflation. A case study shows the effect appears at the representation level, with hidden states staying nearly identical regardless of context presence.

Core claim

Standard on-policy distillation succeeds at internalizing context yet produces context-induced degradation upon reintroduction. The remedy is a lightweight consistency regularizer that first fixes the no-context output with stop-gradient and then applies forward KL to force the context-conditioned output to match it, yielding context removability. This change requires only one additional forward pass, improves no-context performance in many settings, and produces hidden states that remain nearly identical whether or not the context is supplied.

What carries the argument

Consistency regularizer that anchors the no-context output via stop-gradient and penalizes deviation of the context-conditioned output via forward KL divergence.

If this is right

  • Context-conditioned accuracy improves in the majority of the twelve tested configurations.
  • Context-induced harm is reduced in eleven out of twelve settings.
  • Response-length inflation is effectively eliminated.
  • Hidden states remain nearly identical whether or not the privileged context is present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same anchoring idea could be tested on other distillation objectives or on supervised fine-tuning to enforce invariance to auxiliary inputs.
  • Representation-level invariance may offer a general route to robustness against prompt variations beyond the specific distillation setting.
  • If the regularizer preserves performance on larger models, it could reduce reliance on teacher modifications in production distillation pipelines.

Load-bearing premise

Anchoring the context-conditioned output to the no-context output via stop-gradient and forward KL is the right mechanism for achieving robust internalization without new performance trade-offs.

What would settle it

A new model family or domain in which the regularizer is applied yet hidden states still diverge substantially or accuracy still drops when context is reintroduced would falsify the central claim.

read the original abstract

Recent work has shown that on-policy distillation can internalize privileged context, such as system prompts or task hints, into a student model so that the context is no longer needed at inference time. Although this approach successfully improves the student's no-context performance, we identify an interesting and previously unstudied phenomenon: in many settings, reintroducing the original privileged context to the distilled student actually degrades its performance, even on instances it already solves correctly without context. We term this context-induced degradation and argue that robust internalization demands not only matching the teacher's context-conditioned behavior, but also remaining stable when the context is reintroduced, a property we call context removability. Motivated by this observation, we propose a lightweight consistency regularizer that first anchors the student's no-context output via stop-gradient, then penalizes the context-conditioned output for deviating from it via forward KL divergence. This simple addition requires only one extra forward pass per training step, yet it effectively mitigates context-induced degradation and, in many cases, even improves no-context performance. Across 12 configurations spanning diverse domains and model families, our method improves context-conditioned accuracy in the majority of settings, reduces context-induced harm in 11 out of 12 settings, and effectively eliminates response-length inflation. A mechanistic case study further confirms that context removability is achieved at the representation level, with hidden states remaining nearly identical regardless of whether the context is present.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper identifies context-induced degradation in on-policy distillation, where reintroducing privileged context (e.g., system prompts) harms student performance even on instances solved correctly without it. It proposes a lightweight consistency regularizer that anchors the student's no-context output via stop-gradient and applies forward KL divergence to penalize deviations in the context-conditioned output, aiming for context removability. Across 12 configurations, the method is reported to improve context-conditioned accuracy in most settings, reduce context-induced harm in 11/12 cases, eliminate response-length inflation, and achieve similar hidden states with/without context in a mechanistic study.

Significance. If the results hold under scrutiny, the work usefully highlights a robustness failure mode in context internalization via distillation and offers a simple, low-cost regularizer that requires only one extra forward pass without teacher modifications. The mechanistic case study linking removability to representation-level invariance is a positive contribution that strengthens the empirical claims.

major comments (2)
  1. [Method (regularizer definition)] The regularizer (described in the method as anchoring no-context output via stop-gradient then forward KL(context_output || stopgrad(no_context_output))) treats the student's no-context distribution as the immutable target. This choice is load-bearing for the central claim of robust internalization without new trade-offs, yet the manuscript provides no ablation comparing it to alternatives such as an exponential moving average of no-context outputs or the teacher's no-context distribution, despite the skeptic concern that no-context outputs can be noisy or suboptimal early in on-policy training.
  2. [Experiments] The abstract claims aggregate improvements across 12 settings (context-conditioned accuracy gains in majority, harm reduction in 11/12, elimination of length inflation) but the experimental section must supply per-configuration tables with baselines, variance across runs, statistical tests, and controls for the on-policy training dynamics to make the central empirical claim verifiable; the current aggregate reporting leaves the strength of evidence unclear.
minor comments (1)
  1. [Method] Clarify the exact loss weighting between the distillation objective and the new regularizer, including any hyperparameter sensitivity analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, agreeing where revisions are warranted to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method (regularizer definition)] The regularizer (described in the method as anchoring no-context output via stop-gradient then forward KL(context_output || stopgrad(no_context_output))) treats the student's no-context distribution as the immutable target. This choice is load-bearing for the central claim of robust internalization without new trade-offs, yet the manuscript provides no ablation comparing it to alternatives such as an exponential moving average of no-context outputs or the teacher's no-context distribution, despite the skeptic concern that no-context outputs can be noisy or suboptimal early in on-policy training.

    Authors: We selected the stop-gradient formulation for its minimal overhead and direct enforcement of context removability without extra hyperparameters or teacher access. However, we acknowledge that an explicit comparison to EMA or teacher no-context targets would address potential concerns about early-training noise. In the revised version we will add a targeted ablation across 3-4 representative configurations comparing these alternatives, reporting their impact on both no-context accuracy and context-induced degradation. revision: yes

  2. Referee: [Experiments] The abstract claims aggregate improvements across 12 settings (context-conditioned accuracy gains in majority, harm reduction in 11/12, elimination of length inflation) but the experimental section must supply per-configuration tables with baselines, variance across runs, statistical tests, and controls for the on-policy training dynamics to make the central empirical claim verifiable; the current aggregate reporting leaves the strength of evidence unclear.

    Authors: We agree that aggregate reporting alone limits verifiability. The revised manuscript will include expanded experimental tables showing per-configuration results (accuracy, harm reduction, length) together with means and standard deviations from multiple random seeds where runs were repeated, plus a brief discussion of training-dynamic controls (e.g., identical optimizer schedules and data ordering). We will retain the aggregate summary for readability while making the raw per-setting data transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; regularizer introduced as independent design choice.

full rationale

The paper motivates and adds a stop-gradient + forward KL consistency regularizer to address observed context-induced degradation, but this is presented as an empirical engineering addition rather than a derived result. No equations reduce claimed performance gains to quantities defined by prior fitted parameters, self-citations, or ansatzes; the method requires only one extra forward pass and is validated across 12 configurations without self-referential reductions. The central claims rest on experimental outcomes, not on any load-bearing self-definition or imported uniqueness theorem.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard distillation assumptions and the empirical effectiveness of the new regularizer; no free parameters, invented entities, or non-standard axioms are introduced in the abstract.

axioms (1)
  • domain assumption The no-context student output serves as a stable and desirable target for the context-conditioned output.
    The regularizer explicitly uses stop-gradient on the no-context output to define the penalty applied to the context-conditioned output.

pith-pipeline@v0.9.1-grok · 5791 in / 1115 out tokens · 29293 ms · 2026-06-27T10:14:42.862513+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 6 canonical work pages

  1. [1]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =

    Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and ichter, brian and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny , booktitle =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =

  2. [2]

    Language Models are Few-Shot Learners , url =

    Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

  3. [3]

    A Survey on In-context Learning

    Dong, Qingxiu and Li, Lei and Dai, Damai and Zheng, Ce and Ma, Jingyuan and Li, Rui and Xia, Heming and Xu, Jingjing and Wu, Zhiyong and Chang, Baobao and Sun, Xu and Li, Lei and Sui, Zhifang. A Survey on In-context Learning. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.64

  4. [4]

    arXiv preprint arXiv:2601.18734 , year=

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models , author=. arXiv preprint arXiv:2601.18734 , year=

  5. [5]

    arXiv preprint arXiv:2601.19897 , year=

    Self-Distillation Enables Continual Learning , author=. arXiv preprint arXiv:2601.19897 , year=

  6. [6]

    arXiv preprint arXiv:2601.20802 , year=

    Reinforcement Learning via Self-Distillation , author=. arXiv preprint arXiv:2601.20802 , year=

  7. [7]

    arXiv preprint arXiv:2602.12275 , year=

    On-policy context distillation for language models , author=. arXiv preprint arXiv:2602.12275 , year=

  8. [8]

    2026 , url=

    Zhanwang Liu and Yuting Li and Haoyuan Gao and Yexin Li and Linghe Kong and Lichao Sun and Weiran Huang , booktitle=. 2026 , url=

  9. [9]

    arXiv preprint arXiv:1503.02531 , year=

    Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

  10. [10]

    doi: 10.1038/s41586-025-09422-z

    Nature , author =. 2025 , pages =. doi:10.1038/s41586-025-09422-z , abstract =

  11. [11]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  12. [12]

    MiniLLM: Knowledge Distillation of Large Language Models , url =

    Gu, Yuxian and Dong, Li and Wei, Furu and Huang, Minlie , booktitle =. MiniLLM: Knowledge Distillation of Large Language Models , url =

  13. [13]

    arXiv preprint arXiv:2402.13116 , year=

    A survey on knowledge distillation of large language models , author=. arXiv preprint arXiv:2402.13116 , year=

  14. [14]

    2024 , editor =

    Ko, Jongwoo and Kim, Sungnyun and Chen, Tianyi and Yun, Se-Young , booktitle =. 2024 , editor =

  15. [15]

    On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , url =

    Agarwal, Rishabh and Vieillard, Nino and Zhou, Yongchao and Stanczyk, Piotr and Ramos Garea, Sabela and Geist, Matthieu and Bachem, Olivier , booktitle =. On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , url =

  16. [16]

    DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , url=

    DeepSeek-AI , year=. DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , url=

  17. [17]

    arXiv preprint arXiv:2601.02780 , year=

    Mimo-v2-flash technical report , author=. arXiv preprint arXiv:2601.02780 , year=

  18. [18]

    arXiv preprint arXiv:2605.08741 , year=

    Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning , author=. arXiv preprint arXiv:2605.08741 , year=

  19. [19]

    2026 , eprint=

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe , author=. 2026 , eprint=

  20. [20]

    arXiv preprint arXiv:2603.25562 , year=

    Revisiting on-policy distillation: Empirical failure modes and simple fixes , author=. arXiv preprint arXiv:2603.25562 , year=

  21. [21]

    arXiv preprint arXiv:2604.00626 , year=

    A survey of on-policy distillation for large language models , author=. arXiv preprint arXiv:2604.00626 , year=

  22. [22]

    arXiv preprint arXiv:2604.03128 , year=

    Self-distilled rlvr , author=. arXiv preprint arXiv:2604.03128 , year=

  23. [23]

    arXiv preprint arXiv:2604.02288 , year=

    Unifying group-relative and self-distillation policy optimization via sample routing , author=. arXiv preprint arXiv:2604.02288 , year=

  24. [24]

    MixMatch: A Holistic Approach to Semi-Supervised Learning , url =

    Berthelot, David and Carlini, Nicholas and Goodfellow, Ian and Papernot, Nicolas and Oliver, Avital and Raffel, Colin , booktitle =. MixMatch: A Holistic Approach to Semi-Supervised Learning , url =

  25. [25]

    FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence , url =

    Sohn, Kihyuk and Berthelot, David and Carlini, Nicholas and Zhang, Zizhao and Zhang, Han and Raffel, Colin and Cubuk, Ekin Dogus and Kurakin, Alexey and Li, Chun-Liang , booktitle =. FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence , url =

  26. [26]

    Improving the Robustness of Large Language Models via Consistency Alignment

    Zhao, Yukun and Yan, Lingyong and Sun, Weiwei and Xing, Guoliang and Wang, Shuaiqiang and Meng, Chong and Cheng, Zhicong and Ren, Zhaochun and Yin, Dawei. Improving the Robustness of Large Language Models via Consistency Alignment. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-...

  27. [27]

    CREAM: Consistency Regularized Self-Rewarding Language Models , url =

    Wang, Zhaoyang and He, Weilei and Liang, Zhiyuan and Zhang, Xuchao and Bansal, Chetan and Wei, Ying and Zhang, Weitong and Yao, Huaxiu , booktitle =. CREAM: Consistency Regularized Self-Rewarding Language Models , url =

  28. [28]

    Proceedings of the Conference on Health, Inference, and Learning , pages =

    MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering , author =. Proceedings of the Conference on Health, Inference, and Learning , pages =. 2022 , editor =

  29. [29]

    System Prompt Optimization with Meta-Learning , url =

    Choi, Yumin and Baek, Jinheon and Hwang, Sung Ju , booktitle =. System Prompt Optimization with Meta-Learning , url =

  30. [30]

    T weet E val: Unified Benchmark and Comparative Evaluation for Tweet Classification

    Barbieri, Francesco and Camacho-Collados, Jose and Espinosa Anke, Luis and Neves, Leonardo. T weet E val: Unified Benchmark and Comparative Evaluation for Tweet Classification. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.148

  31. [31]

    H ate C heck: Functional Tests for Hate Speech Detection Models

    R. H ate C heck: Functional Tests for Hate Speech Detection Models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.4

  32. [32]

    2022 , pages =

    Complex & Intelligent Systems , author =. 2022 , pages =. doi:10.1007/s40747-021-00608-2 , abstract =

  33. [33]

    arXiv preprint arXiv:2504.11442 , year=

    Textarena , author=. arXiv preprint arXiv:2504.11442 , year=

  34. [34]

    arXiv preprint arXiv:2407.21783 , year=

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  35. [35]

    arXiv preprint arXiv:2505.09388 , year=

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  36. [36]

    2025 , eprint=

    Qwen2.5 Technical Report , author=. 2025 , eprint=

  37. [37]

    2025 , isbn =

    Sheng, Guangming and Zhang, Chi and Ye, Zilingfeng and Wu, Xibin and Zhang, Wang and Zhang, Ru and Peng, Yanghua and Lin, Haibin and Wu, Chuan , title =. Proceedings of the Twentieth European Conference on Computer Systems , pages =. 2025 , isbn =. doi:10.1145/3689031.3696075 , abstract =

  38. [38]

    Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

    Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

  39. [39]

    Demystifying the Roles of LLM Layers in Retrieval, Knowledge, and Reasoning , year=

    Song, Xinyuan and Wang, Keyu and Li, PengXiang and Yin, Lu and Liu, Shiwei , booktitle=. Demystifying the Roles of LLM Layers in Retrieval, Knowledge, and Reasoning , year=