Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback

Adam Block; Daniel Hsu; Peihan Liu; Ved Sriraman

arxiv: 2606.30923 · v1 · pith:5BFJPPYRnew · submitted 2026-06-29 · 💻 cs.LG · cs.AI· stat.ML

Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback

Ved Sriraman , Peihan Liu , Daniel Hsu , Adam Block This is my paper

Pith reviewed 2026-07-01 06:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords imitation learningnoisy expertson-policy distillationbehavior cloningsample complexitylanguage model trainingreinforcement learning

0 comments

The pith

Offline imitation learning from noisy experts requires exponential sample complexity in the horizon to match a clean expert.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models imitation learning where the learner sees only noisy versions of an expert policy but must compete with the reward of the clean expert, as occurs when training language models on imperfect chain-of-thought data. It proves that any offline method, such as behavior cloning, needs sample complexity that grows exponentially with horizon length. In contrast, a novel variant of on-policy distillation that interacts online with the noisy expert achieves only polynomial dependence on the horizon. Under an additional natural condition on the noise distribution, horizon-free guarantees become possible, though at the price of worse scaling with the size of the policy class. The results also cover the case of unknown corruption when the clean expert is deterministic.

Core claim

Offline learning from noisy trajectories is fundamentally hard: to compete with the clean expert, the sample complexity must grow exponentially, in contradistinction to the clean expert setting where no explicit horizon dependence exists. In contrast, online interaction with the noisy expert via a novel variant of OPD enables polynomial dependence on the horizon in general. Under a natural condition on the expert noise distribution, which is necessary for any horizon-free sample complexity, one can obtain such a guarantee, although the algorithm sacrifices statistical efficiency in its dependence on the size of the policy class.

What carries the argument

The noisy expert model (learner observes noisy policy but targets clean-expert reward) together with the novel online variant of on-policy distillation that queries the noisy expert directly.

If this is right

Offline methods incur exponential horizon dependence when the expert is noisy.
The proposed online OPD variant achieves only polynomial horizon dependence in general.
Under the natural noise condition, horizon-free sample complexity is achievable, albeit with worse dependence on policy class size.
The derived loss function supplies an alternative training objective for language models.
The separation extends to the setting of unknown corruption when the clean expert is deterministic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The noise condition may be testable on real teacher data used for chain-of-thought training.
Other online interaction schemes could inherit similar polynomial guarantees.
The framework suggests preferring online distillation over pure offline fine-tuning whenever expert noise is present and horizon length is large.

Load-bearing premise

A natural condition on the expert noise distribution is necessary for any horizon-free sample complexity.

What would settle it

An explicit noise distribution violating the stated natural condition for which some online algorithm still achieves horizon-free sample complexity, or an offline algorithm achieving only polynomial horizon dependence under the same noise model.

Figures

Figures reproduced from arXiv: 2606.30923 by Adam Block, Daniel Hsu, Peihan Liu, Ved Sriraman.

**Figure 2.** Figure 2: Full modular-addition results over 3M expert trajectories. Top: validation loss; Bottom: accuracy. Curves show the mean over three random seeds, with shaded regions indicating one standard deviation. In the low-noise setting (η = 0), the NAIL variants drive validation loss to zero and reach perfect accuracy, while OPD-F plateaus. In the high-noise setting (η = 0.2), the separation is more pronounced: NAIL… view at source ↗

**Figure 3.** Figure 3: Ablation of student rollout temperature for [PITH_FULL_IMAGE:figures/full_fig_p028_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation of CoT length (m) and corruption rate (η) on Modular Addition. Curves show the mean over three random seeds, with shaded regions indicating one standard deviation. A.3.2 Ablating Horizon and Noise Level for Modular Addition We further ablate the dependence of the modular-addition results on the chain-of-thought length and corruption rate. We consider m ∈ {3, 9, 15, 23, 31}, η ∈ {0, 0.05, 0.1, 0.15… view at source ↗

**Figure 5.** Figure 5: Interpolation between NAIL-F and NAIL-R on Modular Addition. The parameter β interpolates between the forward-KL (β = 0) and the reverse-KL (β = 1) losses. Curves show the mean over three random seeds, with shaded regions indicating one standard deviation. All interpolated variants eventually solve the task in both noise regimes, but the learning speed depends strongly on β. Left: in the low-noise setting,… view at source ↗

**Figure 6.** Figure 6: Interpolation between NAIL-F and NAIL-R on GSM-8K. The parameter β interpolates between the forward-KL (β = 0) and the reverse-KL (β = 1) losses. Curves show the mean over three random seeds, with shaded regions indicating one standard deviation. Left: in the low-noise setting (t = 1), forward-KL-heavy objectives learn fastest and reach the highest accuracy. Right: in the high-noise setting (t = 4), all mi… view at source ↗

read the original abstract

Imitation Learning is a natural framework for learning in sequential decision-making systems and has emerged as the dominant paradigm through which we understand language model training. A central puzzle is that, while in theory offline IL can be horizon-free and optimal, in practice online methods such as on-policy distillation often outperform offline methods such as supervised fine-tuning. We propose a noisy expert model to explain this gap, in which the learner only has access to a noisy version of the expert's policy, but wishes to compete against the reward achieved by a clean expert, motivated by the fact that in many applications, e.g. training language models to perform long chains of thought, the expert is often imperfect. In this setting, we show a sharp separation between offline and online IL. Offline learning from noisy trajectories is fundamentally hard: to compete with the clean expert, the sample complexity must grow exponentially, in contradistinction to the clean expert setting where no explicit horizon dependence exists. In contrast, we prove that online interaction with the noisy expert via a novel variant of OPD enables polynomial dependence on the horizon in general. We further show that, under a natural condition on the expert noise distribution, which we show to be necessary for any horizon-free sample complexity, one can obtain such a guarantee, although our proposed algorithm sacrifices statistical efficiency in its dependence on the size of the policy class. Our analysis leads to an alternative loss function that is commonly considered empirically for LM training. We further provide algorithms and lower bounds, and extend our results to the more realistic setting of unknown corruption when the clean expert is deterministic, thereby providing a theoretical foundation for why OPD can outperform SFT when training language models from imperfect teachers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a clean separation where offline imitation from noisy experts needs exponential horizon dependence while a variant of on-policy distillation gets polynomial, with a necessary noise condition for horizon-free bounds.

read the letter

The main thing to know is that this work gives a theoretical account for why on-policy distillation can beat behavior cloning when the expert is noisy. Under their model the learner sees corrupted trajectories but wants to match the clean expert's reward. Offline methods then require exponential samples in the horizon to compete, while their online variant achieves polynomial dependence in general. Under an extra condition on the noise distribution, which they prove is necessary for any horizon-free guarantee, the bound improves further.

What stands out is the separation result itself plus the necessity claim and the extension to unknown corruption when the expert is deterministic. They also recover an alternative loss that lines up with some loss functions already used in language model training. The analysis is explicit about the trade-off: the online algorithm pays in its dependence on the size of the policy class.

The soft spots are modest. The statistical efficiency hit on the policy class size is real and could matter in large models. The noisy expert model is a simplification, so how well the noise condition holds in actual long-horizon LM tasks is left open. No circularity or mismatched assumptions between the lower and upper bounds appear in the claims.

This is for people working on imitation learning theory or on why online methods sometimes win in sequential decision making with imperfect teachers. A reader who wants sample-complexity results that explain a practical gap will find it useful. The derivations look careful enough that it deserves a serious referee.

Referee Report

0 major / 3 minor

Summary. The paper introduces a noisy expert model for imitation learning, motivated by imperfect experts in applications like language model training on long reasoning chains. In this model, the learner accesses only noisy versions of the expert policy but seeks to match the reward of the clean expert. It establishes a sharp separation: offline imitation learning from noisy trajectories requires exponential sample complexity in the horizon to compete with the clean expert (contrasting with horizon-free results for clean experts), while a novel variant of on-policy distillation (OPD) achieves polynomial horizon dependence in general. Under a natural condition on the expert noise distribution—which the paper proves is necessary for any horizon-free sample complexity—the online method yields improved guarantees (though with worse dependence on policy class size). The analysis derives an alternative loss function relevant to LM training and extends the results to unknown corruption when the clean expert is deterministic, providing algorithms and lower bounds throughout.

Significance. If the derivations hold, this work supplies a theoretical explanation for why online methods such as on-policy distillation empirically outperform offline supervised fine-tuning when experts are noisy or imperfect. The offline-online separation, the necessity result for the noise condition, and the suggested alternative loss function constitute substantive contributions to imitation learning theory. The extension to unknown corruption strengthens applicability. The manuscript ships explicit algorithms, lower bounds, and a falsifiable modeling assumption, which are strengths.

minor comments (3)

[Abstract / §1] The abstract and introduction would benefit from a brief explicit statement of the precise assumptions on the MDP (e.g., finite horizon, deterministic vs. stochastic transitions) that underpin both the exponential lower bound and the polynomial upper bound.
[§3 / §4] Notation for the noisy expert distribution and the clean expert policy should be unified across the lower-bound construction and the online algorithm analysis to avoid any potential reader confusion.
[Figures] Figure captions for any sample-complexity plots should explicitly state whether the plotted curves correspond to the general polynomial bound or the horizon-free regime under the noise condition.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review, accurate summary of our contributions on the noisy expert model, the offline-online separation, and the necessity result for the noise condition, as well as the recommendation for minor revision. The referee's assessment aligns well with the manuscript's goals of providing theoretical foundations for why on-policy distillation can outperform offline methods under imperfect experts.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper proposes a noisy expert model and derives theoretical sample complexity bounds, separations between offline and online imitation learning, and a necessity result for a noise condition directly from the model assumptions and standard analysis techniques. No steps reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the derivations remain self-contained against the stated model without renaming known results or smuggling ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the newly introduced noisy expert model and a domain assumption about the noise distribution; no free parameters are mentioned.

axioms (1)

domain assumption natural condition on the expert noise distribution is necessary for any horizon-free sample complexity
Invoked to obtain the polynomial sample complexity guarantee for the online method.

invented entities (1)

noisy expert model no independent evidence
purpose: Models access to a noisy version of the expert policy while the learner competes against the reward of a clean expert
Introduced to explain the observed gap between theory and practice in imitation learning for language models.

pith-pipeline@v0.9.1-grok · 5855 in / 1305 out tokens · 31658 ms · 2026-07-01T06:19:37.265710+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 23 canonical work pages · 19 internal anchors

[1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

2024
[3]

Cot information: Improved sample complexity under chain-of-thought supervision

Awni Altabaa, Omar Montasser, and John Lafferty. Cot information: Improved sample complexity under chain-of-thought supervision. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

A careful examination of large behavior models for multitask dexterous manipulation.Science Robotics, 11(113):eaea6201, 2026

Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching-Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation.Science Robotics, 11(113):eaea6201, 2026

2026
[6]

Provable guaran- tees for generative behavior cloning: Bridging low-level stability and high-level behavior.Advances in Neural Information Processing Systems, 36:48534–48547, 2023

Adam Block, Ali Jadbabaie, Daniel Pfrommer, Max Simchowitz, and Russ Tedrake. Provable guaran- tees for generative behavior cloning: Bridging low-level stability and high-level behavior.Advances in Neural Information Processing Systems, 36:48534–48547, 2023

2023
[7]

Butter- fly effects of sgd noise: Error amplification in behavior cloning and autoregression

Adam Block, Dylan J Foster, Akshay Krishnamurthy, Max Simchowitz, and Cyril Zhang. Butter- fly effects of sgd noise: Error amplification in behavior cloning and autoregression. InThe Twelfth International Conference on Learning Representations, 2024

2024
[8]

Cambridge university press, 2006

Nicolo Cesa-Bianchi and Gábor Lugosi.Prediction, learning, and games. Cambridge university press, 2006. 15

2006
[9]

Learning to generate better than your llm

Jonathan Chang, Kianté Brantley, Rajkumar Ramamurthy, Dipendra Misra, and Wen Sun. Learning to generate better than your llm. InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

2023
[10]

Ash, Akshay Krishnamurthy, and Dylan J Foster

Fan Chen, Audrey Huang, Noah Golowich, Sadhika Malladi, Adam Block, Jordan T. Ash, Akshay Krishnamurthy, and Dylan J Foster. The coverage principle: How pre-training enables post-training. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[11]

Deep imitation learning for autonomous driving in genericurbanscenarioswithenhancedsafety

Jianyu Chen, Bodi Yuan, and Masayoshi Tomizuka. Deep imitation learning for autonomous driving in genericurbanscenarioswithenhancedsafety. In2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 2884–2890. IEEE, 2019

2019
[12]

Self-play fine-tuning converts weak language models to strong language models

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models. InInternational Conference on Machine Learning, pages 6621–6642. PMLR, 2024

2024
[13]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[14]

Training Verifiers to Solve Math Word Problems

KarlCobbe, VineetKosaraju, MohammadBavarian, MarkChen, HeewooJun, LukaszKaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

Efficient first-order contextual bandits: Prediction, allo- cation, and triangular discrimination.Advances in Neural Information Processing Systems, 34:18907– 18919, 2021

Dylan J Foster and Akshay Krishnamurthy. Efficient first-order contextual bandits: Prediction, allo- cation, and triangular discrimination.Advances in Neural Information Processing Systems, 34:18907– 18919, 2021

2021
[16]

The statistical complexity of interactive decision making.arXiv preprint arXiv:2112.13487, 2021

Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making.arXiv preprint arXiv:2112.13487, 2021

work page arXiv 2021
[17]

Is behavior cloning all you need? understanding horizon in imitation learning.Advances in Neural Information Processing Systems, 37:120602–120666, 2024

Dylan J Foster, Adam Block, and Dipendra Misra. Is behavior cloning all you need? understanding horizon in imitation learning.Advances in Neural Information Processing Systems, 37:120602–120666, 2024

2024
[18]

Online estimation via offline estima- tion: An information-theoretic framework.Advances in Neural Information Processing Systems, 37: 42840–42898, 2024

Dylan J Foster, Yanjun Han, Jian Qian, and Alexander Rakhlin. Online estimation via offline estima- tion: An information-theoretic framework.Advances in Neural Information Processing Systems, 37: 42840–42898, 2024

2024
[19]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Cambridge university press, 2000

Sara A Geer.Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000

2000
[21]

Knowledge distillation: A survey

Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey. International journal of computer vision, 129(6):1789–1819, 2021

2021
[22]

MiniLLM: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations, 2024

2024
[23]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 16

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[25]

Generative adversarial imitation learning.Advances in neural information processing systems, 29, 2016

Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning.Advances in neural information processing systems, 29, 2016

2016
[26]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Tinybert: Distilling bert for natural language understanding

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. InFindings of the association for com- putational linguistics: EMNLP 2020, pages 4163–4174, 2020

2020
[28]

A theory of learning with autoregressive chain of thought

Nirmit Joshi, Gal Vardi, Adam Block, Surbhi Goel, Zhiyuan Li, Theodor Misiakiewicz, and Nathan Srebro. A theory of learning with autoregressive chain of thought. InThe Thirty Eighth Annual Conference on Learning Theory, pages 3161–3212. PMLR, 2025

2025
[29]

Gemma 3 Technical Report

Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

NanoGPT.https://github.com/karpathy/nanoGPT, 2022

Andrej Karpathy. NanoGPT.https://github.com/karpathy/nanoGPT, 2022

2022
[31]

Sequence-level knowledge distillation

Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327, 2016

2016
[32]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In3rd International Conference on Learning Representations, ICLR 2015, 2015

2015
[33]

DISTILLM: towards streamlined distillation for large language models

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. DISTILLM: towards streamlined distillation for large language models. InProceedings of the 41st International Conference on Machine Learning, pages 24872–24895, 2024

2024
[34]

Dart: Noise injection for robust imitation learning

Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, and Ken Goldberg. Dart: Noise injection for robust imitation learning. InConference on robot learning, pages 143–156. PMLR, 2017

2017
[35]

Datacomp-lm: In search of the next generation of training sets for language models.Advances in Neural Information Processing Systems, 37:14200–14282, 2024

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation of training sets for language models.Advances in Neural Information Processing Systems, 37:14200–14282, 2024

2024
[36]

Agnostic interactive imitation learning: New theory and practical algorithms.arXiv preprint arXiv:2312.16860, 2023

Yichen Li and Chicheng Zhang. Agnostic interactive imitation learning: New theory and practical algorithms.arXiv preprint arXiv:2312.16860, 2023

work page arXiv 2023
[37]

Interactive and hybrid imitation learning: Provably beating behavior cloning.arXiv preprint arXiv:2412.07057, 2024

Yichen Li and Chicheng Zhang. Interactive and hybrid imitation learning: Provably beating behavior cloning.arXiv preprint arXiv:2412.07057, 2024

work page arXiv 2024
[38]

Chain of thought empowers transformers to solve inherentlyserialproblems

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherentlyserialproblems. InThe Twelfth International Conference on Learning Representations, 2024

2024
[39]

Learning quickly when irrelevant attributes abound: A new linear-threshold algo- rithm.Machine learning, 2(4):285–318, 1988

Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algo- rithm.Machine learning, 2(4):285–318, 1988

1988
[40]

TinyGSM: achieving 80% on GSM8k with one billion parameters

Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Janardhan Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward, and Yi Zhang. TinyGSM: achieving 80% on GSM8k with one billion parameters. InThe 3rd Workshop on Mathematical Reasoning and AI at NeurIPS’23, 2023. 17

2023
[41]

On-policy distillation.Thinking Machines Lab: Connectionism,

Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connectionism,
[42]

https://thinkingmachines.ai/blog/on-policy-distillation

doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation

work page doi:10.64434/tml.20251026
[43]

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 OLMo 2 furious.arXiv preprint arXiv:2501.00656, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Cambridge university press, 2025

Yury Polyanskiy and Yihong Wu.Information theory: From coding to learning. Cambridge university press, 2025

2025
[45]

Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1, 1988

Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1, 1988

1988
[46]

Computational-statistical tradeoffs at the next-token prediction barrier: Autoregressive and imita- tion learning under misspecification

Dhruv Rohatgi, Adam Block, Audrey Huang, Akshay Krishnamurthy, and Dylan J Foster. Computational-statistical tradeoffs at the next-token prediction barrier: Autoregressive and imita- tion learning under misspecification. InThe Thirty Eighth Annual Conference on Learning Theory, pages 4831–4837. PMLR, 2025

2025
[47]

John Wiley & Sons Hoboken, NJ, USA, 2009

Elvezio M Ronchetti and Peter J Huber.Robust statistics. John Wiley & Sons Hoboken, NJ, USA, 2009

2009
[48]

Efficient reductions for imitation learning

Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. InProceedings of the thir- teenth international conference on artificial intelligence and statistics, pages 661–668. JMLR Workshop and Conference Proceedings, 2010

2010
[49]

Reinforcement and Imitation Learning via Interactive No-Regret Learning

Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret learning.arXiv preprint arXiv:1406.5979, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[50]

A reduction of imitation learning and structured prediction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages627–635.JMLRWorkshopandConferenceProceedings, 2011

2011
[51]

Policy Distillation

Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirk- patrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. arXiv preprint arXiv:1511.06295, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[52]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[53]

Contextual bandits and imitation learning with preference-based active queries.Advances in Neural Information Processing Systems, 36: 11261–11295, 2023

Ayush Sekhari, Karthik Sridharan, Wen Sun, and Runzhe Wu. Contextual bandits and imitation learning with preference-based active queries.Advances in Neural Information Processing Systems, 36: 11261–11295, 2023

2023
[54]

Selective sampling and imitation learning via online regression.Advances in Neural Information Processing Systems, 36:67213–67268, 2023

Ayush Sekhari, Karthik Sridharan, Wen Sun, and Runzhe Wu. Selective sampling and imitation learning via online regression.Advances in Neural Information Processing Systems, 36:67213–67268, 2023

2023
[55]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[57]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. 18

1998
[58]

Causal imitation learning under temporally correlated noise

Gokul Swamy, Sanjiban Choudhury, Drew Bagnell, and Steven Wu. Causal imitation learning under temporally correlated noise. InInternational Conference on Machine Learning, pages 20877–20890. PMLR, 2022

2022
[59]

Beyond the 80/20 rule: High-entropy minority tokens drive ef- fective reinforcement learning for LLM reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xiong-Hui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive ef- fective reinforcement learning for LLM reasoning. InThe Thirty-ninth Annual Con...

2026
[60]

Minilm: Deep self- attention distillation for task-agnostic compression of pre-trained transformers.Advances in neural information processing systems, 33:5776–5788, 2020

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self- attention distillation for task-agnostic compression of pre-trained transformers.Advances in neural information processing systems, 33:5776–5788, 2020

2020
[61]

Maurice Weber, Daniel Y. Fu, Quentin Anthony, Yonatan Oren, Shane Adams, Anton Alexandrov, Xiaozhong Lyu, Huu Nguyen, Xiaozhe Yao, Virginia Adams, Ben Athiwaratkun, Rahul Chalamala, Kezhen Chen, Max Ryabinin, Tri Dao, Percy Liang, Christopher Ré, Irina Rish, and Ce Zhang. RedPajama: an open dataset for training large language models.NeurIPS Datasets and B...

2024
[62]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[64]

Embarrassingly Simple Self-Distillation Improves Code Generation

Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, and Yizhe Zhang. Embarrassingly simple self-distillation improves code generation.arXiv preprint arXiv:2604.01193, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[65]

Fromε-entropy to kl-entropy: Analysis of minimum information complexity density estimation.The Annals of Statistics, pages 2180–2210, 2006

Tong Zhang. Fromε-entropy to kl-entropy: Analysis of minimum information complexity density estimation.The Annals of Statistics, pages 2180–2210, 2006

2006
[66]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[67]

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, and Xunliang Cai. SCOPE: Signal-calibrated on-policy distillation enhancement with dual- path adaptive weighting.arXiv preprint arXiv:2604.10688, 2026. Contents 1 Introduction 1 2 Formal Problem Setup and Preliminaries 4 3 Offline Imitation Learning with a...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

2024

[3] [3]

Cot information: Improved sample complexity under chain-of-thought supervision

Awni Altabaa, Omar Montasser, and John Lafferty. Cot information: Improved sample complexity under chain-of-thought supervision. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026

[4] [4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

A careful examination of large behavior models for multitask dexterous manipulation.Science Robotics, 11(113):eaea6201, 2026

Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching-Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation.Science Robotics, 11(113):eaea6201, 2026

2026

[6] [6]

Provable guaran- tees for generative behavior cloning: Bridging low-level stability and high-level behavior.Advances in Neural Information Processing Systems, 36:48534–48547, 2023

Adam Block, Ali Jadbabaie, Daniel Pfrommer, Max Simchowitz, and Russ Tedrake. Provable guaran- tees for generative behavior cloning: Bridging low-level stability and high-level behavior.Advances in Neural Information Processing Systems, 36:48534–48547, 2023

2023

[7] [7]

Butter- fly effects of sgd noise: Error amplification in behavior cloning and autoregression

Adam Block, Dylan J Foster, Akshay Krishnamurthy, Max Simchowitz, and Cyril Zhang. Butter- fly effects of sgd noise: Error amplification in behavior cloning and autoregression. InThe Twelfth International Conference on Learning Representations, 2024

2024

[8] [8]

Cambridge university press, 2006

Nicolo Cesa-Bianchi and Gábor Lugosi.Prediction, learning, and games. Cambridge university press, 2006. 15

2006

[9] [9]

Learning to generate better than your llm

Jonathan Chang, Kianté Brantley, Rajkumar Ramamurthy, Dipendra Misra, and Wen Sun. Learning to generate better than your llm. InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

2023

[10] [10]

Ash, Akshay Krishnamurthy, and Dylan J Foster

Fan Chen, Audrey Huang, Noah Golowich, Sadhika Malladi, Adam Block, Jordan T. Ash, Akshay Krishnamurthy, and Dylan J Foster. The coverage principle: How pre-training enables post-training. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[11] [11]

Deep imitation learning for autonomous driving in genericurbanscenarioswithenhancedsafety

Jianyu Chen, Bodi Yuan, and Masayoshi Tomizuka. Deep imitation learning for autonomous driving in genericurbanscenarioswithenhancedsafety. In2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 2884–2890. IEEE, 2019

2019

[12] [12]

Self-play fine-tuning converts weak language models to strong language models

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models. InInternational Conference on Machine Learning, pages 6621–6642. PMLR, 2024

2024

[13] [13]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025

[14] [14]

Training Verifiers to Solve Math Word Problems

KarlCobbe, VineetKosaraju, MohammadBavarian, MarkChen, HeewooJun, LukaszKaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[15] [15]

Efficient first-order contextual bandits: Prediction, allo- cation, and triangular discrimination.Advances in Neural Information Processing Systems, 34:18907– 18919, 2021

Dylan J Foster and Akshay Krishnamurthy. Efficient first-order contextual bandits: Prediction, allo- cation, and triangular discrimination.Advances in Neural Information Processing Systems, 34:18907– 18919, 2021

2021

[16] [16]

The statistical complexity of interactive decision making.arXiv preprint arXiv:2112.13487, 2021

Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making.arXiv preprint arXiv:2112.13487, 2021

work page arXiv 2021

[17] [17]

Is behavior cloning all you need? understanding horizon in imitation learning.Advances in Neural Information Processing Systems, 37:120602–120666, 2024

Dylan J Foster, Adam Block, and Dipendra Misra. Is behavior cloning all you need? understanding horizon in imitation learning.Advances in Neural Information Processing Systems, 37:120602–120666, 2024

2024

[18] [18]

Online estimation via offline estima- tion: An information-theoretic framework.Advances in Neural Information Processing Systems, 37: 42840–42898, 2024

Dylan J Foster, Yanjun Han, Jian Qian, and Alexander Rakhlin. Online estimation via offline estima- tion: An information-theoretic framework.Advances in Neural Information Processing Systems, 37: 42840–42898, 2024

2024

[19] [19]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

Cambridge university press, 2000

Sara A Geer.Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000

2000

[21] [21]

Knowledge distillation: A survey

Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey. International journal of computer vision, 129(6):1789–1819, 2021

2021

[22] [22]

MiniLLM: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations, 2024

2024

[23] [23]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 16

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[25] [25]

Generative adversarial imitation learning.Advances in neural information processing systems, 29, 2016

Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning.Advances in neural information processing systems, 29, 2016

2016

[26] [26]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

Tinybert: Distilling bert for natural language understanding

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. InFindings of the association for com- putational linguistics: EMNLP 2020, pages 4163–4174, 2020

2020

[28] [28]

A theory of learning with autoregressive chain of thought

Nirmit Joshi, Gal Vardi, Adam Block, Surbhi Goel, Zhiyuan Li, Theodor Misiakiewicz, and Nathan Srebro. A theory of learning with autoregressive chain of thought. InThe Thirty Eighth Annual Conference on Learning Theory, pages 3161–3212. PMLR, 2025

2025

[29] [29]

Gemma 3 Technical Report

Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

NanoGPT.https://github.com/karpathy/nanoGPT, 2022

Andrej Karpathy. NanoGPT.https://github.com/karpathy/nanoGPT, 2022

2022

[31] [31]

Sequence-level knowledge distillation

Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327, 2016

2016

[32] [32]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In3rd International Conference on Learning Representations, ICLR 2015, 2015

2015

[33] [33]

DISTILLM: towards streamlined distillation for large language models

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. DISTILLM: towards streamlined distillation for large language models. InProceedings of the 41st International Conference on Machine Learning, pages 24872–24895, 2024

2024

[34] [34]

Dart: Noise injection for robust imitation learning

Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, and Ken Goldberg. Dart: Noise injection for robust imitation learning. InConference on robot learning, pages 143–156. PMLR, 2017

2017

[35] [35]

Datacomp-lm: In search of the next generation of training sets for language models.Advances in Neural Information Processing Systems, 37:14200–14282, 2024

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation of training sets for language models.Advances in Neural Information Processing Systems, 37:14200–14282, 2024

2024

[36] [36]

Agnostic interactive imitation learning: New theory and practical algorithms.arXiv preprint arXiv:2312.16860, 2023

Yichen Li and Chicheng Zhang. Agnostic interactive imitation learning: New theory and practical algorithms.arXiv preprint arXiv:2312.16860, 2023

work page arXiv 2023

[37] [37]

Interactive and hybrid imitation learning: Provably beating behavior cloning.arXiv preprint arXiv:2412.07057, 2024

Yichen Li and Chicheng Zhang. Interactive and hybrid imitation learning: Provably beating behavior cloning.arXiv preprint arXiv:2412.07057, 2024

work page arXiv 2024

[38] [38]

Chain of thought empowers transformers to solve inherentlyserialproblems

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherentlyserialproblems. InThe Twelfth International Conference on Learning Representations, 2024

2024

[39] [39]

Learning quickly when irrelevant attributes abound: A new linear-threshold algo- rithm.Machine learning, 2(4):285–318, 1988

Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algo- rithm.Machine learning, 2(4):285–318, 1988

1988

[40] [40]

TinyGSM: achieving 80% on GSM8k with one billion parameters

Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Janardhan Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward, and Yi Zhang. TinyGSM: achieving 80% on GSM8k with one billion parameters. InThe 3rd Workshop on Mathematical Reasoning and AI at NeurIPS’23, 2023. 17

2023

[41] [41]

On-policy distillation.Thinking Machines Lab: Connectionism,

Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connectionism,

[42] [42]

https://thinkingmachines.ai/blog/on-policy-distillation

doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation

work page doi:10.64434/tml.20251026

[43] [43]

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 OLMo 2 furious.arXiv preprint arXiv:2501.00656, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Cambridge university press, 2025

Yury Polyanskiy and Yihong Wu.Information theory: From coding to learning. Cambridge university press, 2025

2025

[45] [45]

Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1, 1988

Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1, 1988

1988

[46] [46]

Computational-statistical tradeoffs at the next-token prediction barrier: Autoregressive and imita- tion learning under misspecification

Dhruv Rohatgi, Adam Block, Audrey Huang, Akshay Krishnamurthy, and Dylan J Foster. Computational-statistical tradeoffs at the next-token prediction barrier: Autoregressive and imita- tion learning under misspecification. InThe Thirty Eighth Annual Conference on Learning Theory, pages 4831–4837. PMLR, 2025

2025

[47] [47]

John Wiley & Sons Hoboken, NJ, USA, 2009

Elvezio M Ronchetti and Peter J Huber.Robust statistics. John Wiley & Sons Hoboken, NJ, USA, 2009

2009

[48] [48]

Efficient reductions for imitation learning

Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. InProceedings of the thir- teenth international conference on artificial intelligence and statistics, pages 661–668. JMLR Workshop and Conference Proceedings, 2010

2010

[49] [49]

Reinforcement and Imitation Learning via Interactive No-Regret Learning

Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret learning.arXiv preprint arXiv:1406.5979, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[50] [50]

A reduction of imitation learning and structured prediction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages627–635.JMLRWorkshopandConferenceProceedings, 2011

2011

[51] [51]

Policy Distillation

Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirk- patrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. arXiv preprint arXiv:1511.06295, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[52] [52]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[53] [53]

Contextual bandits and imitation learning with preference-based active queries.Advances in Neural Information Processing Systems, 36: 11261–11295, 2023

Ayush Sekhari, Karthik Sridharan, Wen Sun, and Runzhe Wu. Contextual bandits and imitation learning with preference-based active queries.Advances in Neural Information Processing Systems, 36: 11261–11295, 2023

2023

[54] [54]

Selective sampling and imitation learning via online regression.Advances in Neural Information Processing Systems, 36:67213–67268, 2023

Ayush Sekhari, Karthik Sridharan, Wen Sun, and Runzhe Wu. Selective sampling and imitation learning via online regression.Advances in Neural Information Processing Systems, 36:67213–67268, 2023

2023

[55] [55]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [56]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[57] [57]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. 18

1998

[58] [58]

Causal imitation learning under temporally correlated noise

Gokul Swamy, Sanjiban Choudhury, Drew Bagnell, and Steven Wu. Causal imitation learning under temporally correlated noise. InInternational Conference on Machine Learning, pages 20877–20890. PMLR, 2022

2022

[59] [59]

Beyond the 80/20 rule: High-entropy minority tokens drive ef- fective reinforcement learning for LLM reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xiong-Hui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive ef- fective reinforcement learning for LLM reasoning. InThe Thirty-ninth Annual Con...

2026

[60] [60]

Minilm: Deep self- attention distillation for task-agnostic compression of pre-trained transformers.Advances in neural information processing systems, 33:5776–5788, 2020

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self- attention distillation for task-agnostic compression of pre-trained transformers.Advances in neural information processing systems, 33:5776–5788, 2020

2020

[61] [61]

Maurice Weber, Daniel Y. Fu, Quentin Anthony, Yonatan Oren, Shane Adams, Anton Alexandrov, Xiaozhong Lyu, Huu Nguyen, Xiaozhe Yao, Virginia Adams, Ben Athiwaratkun, Rahul Chalamala, Kezhen Chen, Max Ryabinin, Tri Dao, Percy Liang, Christopher Ré, Irina Rish, and Ce Zhang. RedPajama: an open dataset for training large language models.NeurIPS Datasets and B...

2024

[62] [62]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[63] [63]

Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[64] [64]

Embarrassingly Simple Self-Distillation Improves Code Generation

Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, and Yizhe Zhang. Embarrassingly simple self-distillation improves code generation.arXiv preprint arXiv:2604.01193, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[65] [65]

Fromε-entropy to kl-entropy: Analysis of minimum information complexity density estimation.The Annals of Statistics, pages 2180–2210, 2006

Tong Zhang. Fromε-entropy to kl-entropy: Analysis of minimum information complexity density estimation.The Annals of Statistics, pages 2180–2210, 2006

2006

[66] [66]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[67] [67]

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, and Xunliang Cai. SCOPE: Signal-calibrated on-policy distillation enhancement with dual- path adaptive weighting.arXiv preprint arXiv:2604.10688, 2026. Contents 1 Introduction 1 2 Formal Problem Setup and Preliminaries 4 3 Offline Imitation Learning with a...

work page internal anchor Pith review Pith/arXiv arXiv 2026