pith. sign in

arxiv: 2606.30923 · v1 · pith:5BFJPPYRnew · submitted 2026-06-29 · 💻 cs.LG · cs.AI· stat.ML

Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback

Pith reviewed 2026-07-01 06:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords imitation learningnoisy expertson-policy distillationbehavior cloningsample complexitylanguage model trainingreinforcement learning
0
0 comments X

The pith

Offline imitation learning from noisy experts requires exponential sample complexity in the horizon to match a clean expert.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models imitation learning where the learner sees only noisy versions of an expert policy but must compete with the reward of the clean expert, as occurs when training language models on imperfect chain-of-thought data. It proves that any offline method, such as behavior cloning, needs sample complexity that grows exponentially with horizon length. In contrast, a novel variant of on-policy distillation that interacts online with the noisy expert achieves only polynomial dependence on the horizon. Under an additional natural condition on the noise distribution, horizon-free guarantees become possible, though at the price of worse scaling with the size of the policy class. The results also cover the case of unknown corruption when the clean expert is deterministic.

Core claim

Offline learning from noisy trajectories is fundamentally hard: to compete with the clean expert, the sample complexity must grow exponentially, in contradistinction to the clean expert setting where no explicit horizon dependence exists. In contrast, online interaction with the noisy expert via a novel variant of OPD enables polynomial dependence on the horizon in general. Under a natural condition on the expert noise distribution, which is necessary for any horizon-free sample complexity, one can obtain such a guarantee, although the algorithm sacrifices statistical efficiency in its dependence on the size of the policy class.

What carries the argument

The noisy expert model (learner observes noisy policy but targets clean-expert reward) together with the novel online variant of on-policy distillation that queries the noisy expert directly.

If this is right

  • Offline methods incur exponential horizon dependence when the expert is noisy.
  • The proposed online OPD variant achieves only polynomial horizon dependence in general.
  • Under the natural noise condition, horizon-free sample complexity is achievable, albeit with worse dependence on policy class size.
  • The derived loss function supplies an alternative training objective for language models.
  • The separation extends to the setting of unknown corruption when the clean expert is deterministic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The noise condition may be testable on real teacher data used for chain-of-thought training.
  • Other online interaction schemes could inherit similar polynomial guarantees.
  • The framework suggests preferring online distillation over pure offline fine-tuning whenever expert noise is present and horizon length is large.

Load-bearing premise

A natural condition on the expert noise distribution is necessary for any horizon-free sample complexity.

What would settle it

An explicit noise distribution violating the stated natural condition for which some online algorithm still achieves horizon-free sample complexity, or an offline algorithm achieving only polynomial horizon dependence under the same noise model.

Figures

Figures reproduced from arXiv: 2606.30923 by Adam Block, Daniel Hsu, Peihan Liu, Ved Sriraman.

Figure 1
Figure 1. Figure 1: Comparison of offline BC, standard OPD from [ [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Full modular-addition results over 3M expert trajectories. Top: validation loss; Bottom: accu￾racy. Curves show the mean over three random seeds, with shaded regions indicating one standard deviation. In the low-noise setting (η = 0), the NAIL variants drive validation loss to zero and reach perfect accuracy, while OPD-F plateaus. In the high-noise setting (η = 0.2), the separation is more pronounced: NAIL… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation of student rollout temperature for [PITH_FULL_IMAGE:figures/full_fig_p028_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation of CoT length (m) and corruption rate (η) on Modular Addition. Curves show the mean over three random seeds, with shaded regions indicating one standard deviation. A.3.2 Ablating Horizon and Noise Level for Modular Addition We further ablate the dependence of the modular-addition results on the chain-of-thought length and corruption rate. We consider m ∈ {3, 9, 15, 23, 31}, η ∈ {0, 0.05, 0.1, 0.15… view at source ↗
Figure 5
Figure 5. Figure 5: Interpolation between NAIL-F and NAIL-R on Modular Addition. The parameter β interpolates between the forward-KL (β = 0) and the reverse-KL (β = 1) losses. Curves show the mean over three random seeds, with shaded regions indicating one standard deviation. All interpolated variants eventually solve the task in both noise regimes, but the learning speed depends strongly on β. Left: in the low-noise setting,… view at source ↗
Figure 6
Figure 6. Figure 6: Interpolation between NAIL-F and NAIL-R on GSM-8K. The parameter β interpolates between the forward-KL (β = 0) and the reverse-KL (β = 1) losses. Curves show the mean over three random seeds, with shaded regions indicating one standard deviation. Left: in the low-noise setting (t = 1), forward-KL-heavy objectives learn fastest and reach the highest accuracy. Right: in the high-noise setting (t = 4), all mi… view at source ↗
read the original abstract

Imitation Learning is a natural framework for learning in sequential decision-making systems and has emerged as the dominant paradigm through which we understand language model training. A central puzzle is that, while in theory offline IL can be horizon-free and optimal, in practice online methods such as on-policy distillation often outperform offline methods such as supervised fine-tuning. We propose a noisy expert model to explain this gap, in which the learner only has access to a noisy version of the expert's policy, but wishes to compete against the reward achieved by a clean expert, motivated by the fact that in many applications, e.g. training language models to perform long chains of thought, the expert is often imperfect. In this setting, we show a sharp separation between offline and online IL. Offline learning from noisy trajectories is fundamentally hard: to compete with the clean expert, the sample complexity must grow exponentially, in contradistinction to the clean expert setting where no explicit horizon dependence exists. In contrast, we prove that online interaction with the noisy expert via a novel variant of OPD enables polynomial dependence on the horizon in general. We further show that, under a natural condition on the expert noise distribution, which we show to be necessary for any horizon-free sample complexity, one can obtain such a guarantee, although our proposed algorithm sacrifices statistical efficiency in its dependence on the size of the policy class. Our analysis leads to an alternative loss function that is commonly considered empirically for LM training. We further provide algorithms and lower bounds, and extend our results to the more realistic setting of unknown corruption when the clean expert is deterministic, thereby providing a theoretical foundation for why OPD can outperform SFT when training language models from imperfect teachers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces a noisy expert model for imitation learning, motivated by imperfect experts in applications like language model training on long reasoning chains. In this model, the learner accesses only noisy versions of the expert policy but seeks to match the reward of the clean expert. It establishes a sharp separation: offline imitation learning from noisy trajectories requires exponential sample complexity in the horizon to compete with the clean expert (contrasting with horizon-free results for clean experts), while a novel variant of on-policy distillation (OPD) achieves polynomial horizon dependence in general. Under a natural condition on the expert noise distribution—which the paper proves is necessary for any horizon-free sample complexity—the online method yields improved guarantees (though with worse dependence on policy class size). The analysis derives an alternative loss function relevant to LM training and extends the results to unknown corruption when the clean expert is deterministic, providing algorithms and lower bounds throughout.

Significance. If the derivations hold, this work supplies a theoretical explanation for why online methods such as on-policy distillation empirically outperform offline supervised fine-tuning when experts are noisy or imperfect. The offline-online separation, the necessity result for the noise condition, and the suggested alternative loss function constitute substantive contributions to imitation learning theory. The extension to unknown corruption strengthens applicability. The manuscript ships explicit algorithms, lower bounds, and a falsifiable modeling assumption, which are strengths.

minor comments (3)
  1. [Abstract / §1] The abstract and introduction would benefit from a brief explicit statement of the precise assumptions on the MDP (e.g., finite horizon, deterministic vs. stochastic transitions) that underpin both the exponential lower bound and the polynomial upper bound.
  2. [§3 / §4] Notation for the noisy expert distribution and the clean expert policy should be unified across the lower-bound construction and the online algorithm analysis to avoid any potential reader confusion.
  3. [Figures] Figure captions for any sample-complexity plots should explicitly state whether the plotted curves correspond to the general polynomial bound or the horizon-free regime under the noise condition.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review, accurate summary of our contributions on the noisy expert model, the offline-online separation, and the necessity result for the noise condition, as well as the recommendation for minor revision. The referee's assessment aligns well with the manuscript's goals of providing theoretical foundations for why on-policy distillation can outperform offline methods under imperfect experts.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper proposes a noisy expert model and derives theoretical sample complexity bounds, separations between offline and online imitation learning, and a necessity result for a noise condition directly from the model assumptions and standard analysis techniques. No steps reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the derivations remain self-contained against the stated model without renaming known results or smuggling ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the newly introduced noisy expert model and a domain assumption about the noise distribution; no free parameters are mentioned.

axioms (1)
  • domain assumption natural condition on the expert noise distribution is necessary for any horizon-free sample complexity
    Invoked to obtain the polynomial sample complexity guarantee for the online method.
invented entities (1)
  • noisy expert model no independent evidence
    purpose: Models access to a noisy version of the expert policy while the learner competes against the reward of a clean expert
    Introduced to explain the observed gap between theory and practice in imitation learning for language models.

pith-pipeline@v0.9.1-grok · 5855 in / 1305 out tokens · 31658 ms · 2026-07-01T06:19:37.265710+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 23 canonical work pages · 19 internal anchors

  1. [1]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

  2. [2]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

  3. [3]

    Cot information: Improved sample complexity under chain-of-thought supervision

    Awni Altabaa, Omar Montasser, and John Lafferty. Cot information: Improved sample complexity under chain-of-thought supervision. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  4. [4]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  5. [5]

    A careful examination of large behavior models for multitask dexterous manipulation.Science Robotics, 11(113):eaea6201, 2026

    Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching-Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation.Science Robotics, 11(113):eaea6201, 2026

  6. [6]

    Provable guaran- tees for generative behavior cloning: Bridging low-level stability and high-level behavior.Advances in Neural Information Processing Systems, 36:48534–48547, 2023

    Adam Block, Ali Jadbabaie, Daniel Pfrommer, Max Simchowitz, and Russ Tedrake. Provable guaran- tees for generative behavior cloning: Bridging low-level stability and high-level behavior.Advances in Neural Information Processing Systems, 36:48534–48547, 2023

  7. [7]

    Butter- fly effects of sgd noise: Error amplification in behavior cloning and autoregression

    Adam Block, Dylan J Foster, Akshay Krishnamurthy, Max Simchowitz, and Cyril Zhang. Butter- fly effects of sgd noise: Error amplification in behavior cloning and autoregression. InThe Twelfth International Conference on Learning Representations, 2024

  8. [8]

    Cambridge university press, 2006

    Nicolo Cesa-Bianchi and Gábor Lugosi.Prediction, learning, and games. Cambridge university press, 2006. 15

  9. [9]

    Learning to generate better than your llm

    Jonathan Chang, Kianté Brantley, Rajkumar Ramamurthy, Dipendra Misra, and Wen Sun. Learning to generate better than your llm. InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

  10. [10]

    Ash, Akshay Krishnamurthy, and Dylan J Foster

    Fan Chen, Audrey Huang, Noah Golowich, Sadhika Malladi, Adam Block, Jordan T. Ash, Akshay Krishnamurthy, and Dylan J Foster. The coverage principle: How pre-training enables post-training. InThe Fourteenth International Conference on Learning Representations, 2026

  11. [11]

    Deep imitation learning for autonomous driving in genericurbanscenarioswithenhancedsafety

    Jianyu Chen, Bodi Yuan, and Masayoshi Tomizuka. Deep imitation learning for autonomous driving in genericurbanscenarioswithenhancedsafety. In2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 2884–2890. IEEE, 2019

  12. [12]

    Self-play fine-tuning converts weak language models to strong language models

    Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models. InInternational Conference on Machine Learning, pages 6621–6642. PMLR, 2024

  13. [13]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  14. [14]

    Training Verifiers to Solve Math Word Problems

    KarlCobbe, VineetKosaraju, MohammadBavarian, MarkChen, HeewooJun, LukaszKaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  15. [15]

    Efficient first-order contextual bandits: Prediction, allo- cation, and triangular discrimination.Advances in Neural Information Processing Systems, 34:18907– 18919, 2021

    Dylan J Foster and Akshay Krishnamurthy. Efficient first-order contextual bandits: Prediction, allo- cation, and triangular discrimination.Advances in Neural Information Processing Systems, 34:18907– 18919, 2021

  16. [16]

    The statistical complexity of interactive decision making.arXiv preprint arXiv:2112.13487, 2021

    Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making.arXiv preprint arXiv:2112.13487, 2021

  17. [17]

    Is behavior cloning all you need? understanding horizon in imitation learning.Advances in Neural Information Processing Systems, 37:120602–120666, 2024

    Dylan J Foster, Adam Block, and Dipendra Misra. Is behavior cloning all you need? understanding horizon in imitation learning.Advances in Neural Information Processing Systems, 37:120602–120666, 2024

  18. [18]

    Online estimation via offline estima- tion: An information-theoretic framework.Advances in Neural Information Processing Systems, 37: 42840–42898, 2024

    Dylan J Foster, Yanjun Han, Jian Qian, and Alexander Rakhlin. Online estimation via offline estima- tion: An information-theoretic framework.Advances in Neural Information Processing Systems, 37: 42840–42898, 2024

  19. [19]

    Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

    Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

  20. [20]

    Cambridge university press, 2000

    Sara A Geer.Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000

  21. [21]

    Knowledge distillation: A survey

    Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey. International journal of computer vision, 129(6):1789–1819, 2021

  22. [22]

    MiniLLM: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations, 2024

  23. [23]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 16

  24. [24]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

  25. [25]

    Generative adversarial imitation learning.Advances in neural information processing systems, 29, 2016

    Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning.Advances in neural information processing systems, 29, 2016

  26. [26]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

  27. [27]

    Tinybert: Distilling bert for natural language understanding

    Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. InFindings of the association for com- putational linguistics: EMNLP 2020, pages 4163–4174, 2020

  28. [28]

    A theory of learning with autoregressive chain of thought

    Nirmit Joshi, Gal Vardi, Adam Block, Surbhi Goel, Zhiyuan Li, Theodor Misiakiewicz, and Nathan Srebro. A theory of learning with autoregressive chain of thought. InThe Thirty Eighth Annual Conference on Learning Theory, pages 3161–3212. PMLR, 2025

  29. [29]

    Gemma 3 Technical Report

    Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

  30. [30]

    NanoGPT.https://github.com/karpathy/nanoGPT, 2022

    Andrej Karpathy. NanoGPT.https://github.com/karpathy/nanoGPT, 2022

  31. [31]

    Sequence-level knowledge distillation

    Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327, 2016

  32. [32]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In3rd International Conference on Learning Representations, ICLR 2015, 2015

  33. [33]

    DISTILLM: towards streamlined distillation for large language models

    Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. DISTILLM: towards streamlined distillation for large language models. InProceedings of the 41st International Conference on Machine Learning, pages 24872–24895, 2024

  34. [34]

    Dart: Noise injection for robust imitation learning

    Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, and Ken Goldberg. Dart: Noise injection for robust imitation learning. InConference on robot learning, pages 143–156. PMLR, 2017

  35. [35]

    Datacomp-lm: In search of the next generation of training sets for language models.Advances in Neural Information Processing Systems, 37:14200–14282, 2024

    Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation of training sets for language models.Advances in Neural Information Processing Systems, 37:14200–14282, 2024

  36. [36]

    Agnostic interactive imitation learning: New theory and practical algorithms.arXiv preprint arXiv:2312.16860, 2023

    Yichen Li and Chicheng Zhang. Agnostic interactive imitation learning: New theory and practical algorithms.arXiv preprint arXiv:2312.16860, 2023

  37. [37]

    Interactive and hybrid imitation learning: Provably beating behavior cloning.arXiv preprint arXiv:2412.07057, 2024

    Yichen Li and Chicheng Zhang. Interactive and hybrid imitation learning: Provably beating behavior cloning.arXiv preprint arXiv:2412.07057, 2024

  38. [38]

    Chain of thought empowers transformers to solve inherentlyserialproblems

    Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherentlyserialproblems. InThe Twelfth International Conference on Learning Representations, 2024

  39. [39]

    Learning quickly when irrelevant attributes abound: A new linear-threshold algo- rithm.Machine learning, 2(4):285–318, 1988

    Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algo- rithm.Machine learning, 2(4):285–318, 1988

  40. [40]

    TinyGSM: achieving 80% on GSM8k with one billion parameters

    Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Janardhan Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward, and Yi Zhang. TinyGSM: achieving 80% on GSM8k with one billion parameters. InThe 3rd Workshop on Mathematical Reasoning and AI at NeurIPS’23, 2023. 17

  41. [41]

    On-policy distillation.Thinking Machines Lab: Connectionism,

    Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connectionism,

  42. [42]

    https://thinkingmachines.ai/blog/on-policy-distillation

    doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation

  43. [43]

    2 OLMo 2 Furious

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 OLMo 2 furious.arXiv preprint arXiv:2501.00656, 2024

  44. [44]

    Cambridge university press, 2025

    Yury Polyanskiy and Yihong Wu.Information theory: From coding to learning. Cambridge university press, 2025

  45. [45]

    Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1, 1988

    Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1, 1988

  46. [46]

    Computational-statistical tradeoffs at the next-token prediction barrier: Autoregressive and imita- tion learning under misspecification

    Dhruv Rohatgi, Adam Block, Audrey Huang, Akshay Krishnamurthy, and Dylan J Foster. Computational-statistical tradeoffs at the next-token prediction barrier: Autoregressive and imita- tion learning under misspecification. InThe Thirty Eighth Annual Conference on Learning Theory, pages 4831–4837. PMLR, 2025

  47. [47]

    John Wiley & Sons Hoboken, NJ, USA, 2009

    Elvezio M Ronchetti and Peter J Huber.Robust statistics. John Wiley & Sons Hoboken, NJ, USA, 2009

  48. [48]

    Efficient reductions for imitation learning

    Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. InProceedings of the thir- teenth international conference on artificial intelligence and statistics, pages 661–668. JMLR Workshop and Conference Proceedings, 2010

  49. [49]

    Reinforcement and Imitation Learning via Interactive No-Regret Learning

    Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret learning.arXiv preprint arXiv:1406.5979, 2014

  50. [50]

    A reduction of imitation learning and structured prediction to no-regret online learning

    Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages627–635.JMLRWorkshopandConferenceProceedings, 2011

  51. [51]

    Policy Distillation

    Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirk- patrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. arXiv preprint arXiv:1511.06295, 2015

  52. [52]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019

  53. [53]

    Contextual bandits and imitation learning with preference-based active queries.Advances in Neural Information Processing Systems, 36: 11261–11295, 2023

    Ayush Sekhari, Karthik Sridharan, Wen Sun, and Runzhe Wu. Contextual bandits and imitation learning with preference-based active queries.Advances in Neural Information Processing Systems, 36: 11261–11295, 2023

  54. [54]

    Selective sampling and imitation learning via online regression.Advances in Neural Information Processing Systems, 36:67213–67268, 2023

    Ayush Sekhari, Karthik Sridharan, Wen Sun, and Runzhe Wu. Selective sampling and imitation learning via online regression.Advances in Neural Information Processing Systems, 36:67213–67268, 2023

  55. [55]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  56. [56]

    A Survey of On-Policy Distillation for Large Language Models

    Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626, 2026

  57. [57]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. 18

  58. [58]

    Causal imitation learning under temporally correlated noise

    Gokul Swamy, Sanjiban Choudhury, Drew Bagnell, and Steven Wu. Causal imitation learning under temporally correlated noise. InInternational Conference on Machine Learning, pages 20877–20890. PMLR, 2022

  59. [59]

    Beyond the 80/20 rule: High-entropy minority tokens drive ef- fective reinforcement learning for LLM reasoning

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xiong-Hui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive ef- fective reinforcement learning for LLM reasoning. InThe Thirty-ninth Annual Con...

  60. [60]

    Minilm: Deep self- attention distillation for task-agnostic compression of pre-trained transformers.Advances in neural information processing systems, 33:5776–5788, 2020

    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self- attention distillation for task-agnostic compression of pre-trained transformers.Advances in neural information processing systems, 33:5776–5788, 2020

  61. [61]

    Maurice Weber, Daniel Y. Fu, Quentin Anthony, Yonatan Oren, Shane Adams, Anton Alexandrov, Xiaozhong Lyu, Huu Nguyen, Xiaozhe Yao, Virginia Adams, Ben Athiwaratkun, Rahul Chalamala, Kezhen Chen, Max Ryabinin, Tri Dao, Percy Liang, Christopher Ré, Irina Rish, and Ce Zhang. RedPajama: an open dataset for training large language models.NeurIPS Datasets and B...

  62. [62]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  63. [63]

    Self-Distilled RLVR

    Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

  64. [64]

    Embarrassingly Simple Self-Distillation Improves Code Generation

    Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, and Yizhe Zhang. Embarrassingly simple self-distillation improves code generation.arXiv preprint arXiv:2604.01193, 2026

  65. [65]

    Fromε-entropy to kl-entropy: Analysis of minimum information complexity density estimation.The Annals of Statistics, pages 2180–2210, 2006

    Tong Zhang. Fromε-entropy to kl-entropy: Analysis of minimum information complexity density estimation.The Annals of Statistics, pages 2180–2210, 2006

  66. [66]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

  67. [67]

    SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

    Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, and Xunliang Cai. SCOPE: Signal-calibrated on-policy distillation enhancement with dual- path adaptive weighting.arXiv preprint arXiv:2604.10688, 2026. Contents 1 Introduction 1 2 Formal Problem Setup and Preliminaries 4 3 Offline Imitation Learning with a...