When Does Online Imitation Learning Help in LLM Post-Training? The Role of (Non-)Realizability Beyond Horizon

Andrej Risteski; Bingbin Liu; Huaqing Zhang; Jingchu Gai; Juno Kim

arxiv: 2606.30445 · v1 · pith:DLG3BE5Znew · submitted 2026-06-29 · 💻 cs.LG

When Does Online Imitation Learning Help in LLM Post-Training? The Role of (Non-)Realizability Beyond Horizon

Huaqing Zhang , Jingchu Gai , Juno Kim , Bingbin Liu , Andrej Risteski This is my paper

Pith reviewed 2026-06-30 07:38 UTC · model grok-4.3

classification 💻 cs.LG

keywords imitation learningLLM post-trainingrealizabilityonline imitation learningoffline imitation learningmisspecificationinformation-theoretic bounds

0 comments

The pith

Offline imitation learning encounters an information-theoretic bottleneck in non-realizable settings even for H=1, while online IL achieves high performance under reward-relative misspecification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the view that error accumulation over long horizons explains why online imitation learning outperforms offline supervised fine-tuning in LLM post-training. It instead focuses on realizability: whether the student policy class can represent the expert policy. When realizability holds, offline IL already reaches expert performance. When it does not, offline IL is constrained by an information-theoretic limit that applies even to single-step tasks, yet online IL can still succeed despite large mismatches between expert and student distributions under a specific structural form of misspecification tied to the reward.

Core claim

Under realizability, offline IL matches expert performance. In non-realizable settings, offline IL encounters an information-theoretic bottleneck even when horizon H=1, and online IL provably achieves high performance despite a large distributional mismatch between the expert and student policies under a structural characterization of misspecification relative to the reward.

What carries the argument

Structural characterization of misspecification relative to the reward, which separates cases where online interaction overcomes the offline information limit from those where it does not.

If this is right

Offline IL suffices to match expert performance whenever the student policy class can realize the expert.
The information-theoretic lower bound on offline IL holds independently of horizon length in non-realizable cases.
Online IL recovers high performance in misspecified settings by collecting on-policy data even when expert and student distributions differ substantially.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

For LLM tasks known to have limited policy expressivity, resources may be better spent on online data collection than on enlarging offline datasets.
The result separates the value of online interaction from horizon length, suggesting short-sequence tasks can still benefit from online methods if misspecified.
Practical application requires identifying whether a given fine-tuning task satisfies the reward-relative misspecification structure.

Load-bearing premise

The structural characterization of misspecification relative to the reward accurately models the non-realizability encountered in the LLM post-training tasks and policy classes considered.

What would settle it

A simple H=1 task in which the policy class is misspecified according to the structural characterization yet offline IL still matches expert performance, or in which online IL fails to achieve high performance.

Figures

Figures reproduced from arXiv: 2606.30445 by Andrej Risteski, Bingbin Liu, Huaqing Zhang, Jingchu Gai, Juno Kim.

**Figure 1.** Figure 1: Offline IL suffices with realizable experts. Across different tasks, SFT with a realizable expert fully matches expert performance, and on-policy distillation yields no accuracy gains or faster training. In contrast, when the expert is non-realizable, SFT exhibits a significant performance gap. Instruct [57] as the base model for Countdown, Llama-3.2-3B [19] fine-tuned on OpenR1 [11] for GSM8K, and DeepSee… view at source ↗

**Figure 2.** Figure 2: Bounding performance gap by expert-student discrepancy can be overly pessimistic. We compare DeepSeek-R1-0528 (expert) and DeepSeek-R1-0528-Qwen3-8B (student) on AIME benchmark. (a) The expert and student differ substantially in response-length distributions and keyword frequencies. (b) Length-induced TV distance and keyword-frequency gaps are lower bounds of the TV distance between response distributions,… view at source ↗

**Figure 3.** Figure 3: Synthetic example: misspecification relative to reward. Actions are (x, y) ∈ R 2 and policies are Gaussians; the reward is r(x, y) = 1[y ≥ 0] (shaded region marks r(x, y) = 1). Expert π e = N ((−2, 2), I2). Left: student class ΠA: µx = 0 (red dashed line) yields a reverse-KL projection with V (πˆA) ≈ 0.977. Right: student class ΠB: µy = µx yields V (πˆB) = 0.5. The performance gap arises from different mis… view at source ↗

**Figure 5.** Figure 5: OOD generalization and catastrophic forgetting. Left: Models are evaluated on Countdown instances with larger number range. Right: Catastrophic forgetting during GSM8K training. Models are evaluated on MMLU benchmark. Both on-policy distillation and SFT from a realizable expert achieve strong OOD performance that matches the expert, whereas SFT from a non-realizable expert yields much worse OOD generalizat… view at source ↗

**Figure 6.** Figure 6: The on-policy distillation results in Section 5.3. Starting from the same base model, DeepSeek-R1-Distill-Qwen-1.5B, we perform on-policy distillation using two 7B experts: R1-Distill-7B and Skywork-7B. Distillation from Skywork-7B improves average performance on AIME 2024 and 2025 from 25.0% to 32.8%, whereas distillation from R1-Distill-7B improves it to 27.1% (averaged over three runs). Each sample cons… view at source ↗

read the original abstract

Online imitation learning (IL), particularly on-policy distillation, has emerged as a strong LLM post-training approach, often outperforming offline supervised fine-tuning (SFT). Yet a principled understanding of when and why online interaction helps remains unclear. In this work, we challenge the view that error accumulation is the main source of online IL's advantage, and instead show that the benefits of online interaction depend critically on whether the setting is realizable, i.e., whether the student policy class can represent the expert policy. Under realizability, we empirically find that offline IL already matches expert performance. In contrast, in non-realizable (misspecified) settings, we prove that offline IL encounters an information-theoretic bottleneck even when horizon $H=1$, and propose a structural characterization of misspecification relative to the reward, under which online IL provably achieves high performance despite a large distributional mismatch between the expert and student policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core claim is that online IL helps mainly in non-realizable settings via a structural misspecification characterization, not error accumulation, with an H=1 info bottleneck for offline.

read the letter

The main thing to know is that this paper shifts the explanation for why online imitation learning beats offline SFT in LLM post-training. It argues the advantage comes from non-realizability of the expert policy in the student class, not from compounding errors over long horizons. They prove an information-theoretic lower bound for offline IL even at H=1 under their setup, and introduce a structural characterization of misspecification relative to the reward under which online IL still succeeds despite large policy mismatch.

What the paper does well is the clean realizable versus non-realizable split, backed by experiments showing offline already matches expert performance when realizability holds. The H=1 lower bound and the structural lens are new relative to prior work on horizon-dependent error accumulation. The abstract states the proofs and empirical findings directly.

The soft spot is the structural characterization itself. The central claims rest on this being the relevant form of misspecification for LLM policy classes and rewards. If typical post-training misspecification looks different—optimization artifacts, token-level versus sequence-level gaps, or other structures—the lower bound and online advantage may not transfer. The abstract gives no independent check that this characterization covers common cases.

This is for researchers working on LLM post-training pipelines or imitation learning theory who need guidance on when online interaction is worth the extra cost. It deserves a serious referee because the question is practical, the theoretical distinction is sharp, and the empirical contrast is straightforward to check.

Referee Report

2 major / 2 minor

Summary. The paper claims that online imitation learning's advantage over offline SFT in LLM post-training stems from non-realizability of the student policy class (rather than horizon-induced error accumulation). Under realizability, offline IL matches expert performance empirically. In non-realizable settings, offline IL faces an information-theoretic bottleneck even at H=1; under a proposed structural characterization of misspecification relative to the reward, online IL achieves high performance despite large expert-student policy mismatch.

Significance. If the structural characterization of misspecification is the relevant one for LLM policy classes and rewards, the result supplies a principled account of when online interaction helps, shifting emphasis from horizon length to realizability and offering guidance for post-training design.

major comments (2)

[Section introducing the structural characterization (and associated theorems)] The information-theoretic lower bound for offline IL (even at H=1) and the online-IL guarantee both rest on the proposed structural characterization of misspecification w.r.t. the reward; the manuscript does not supply evidence that this form (as opposed to token-level mismatch or optimization-induced misspecification) is the operative one in the LLM tasks considered.
[Empirical evaluation section] The realizability experiments claim offline IL already matches expert performance, but the policy class, reward structure, and data-exclusion rules used to enforce realizability are not stated with sufficient precision to confirm that the empirical regime matches the theoretical assumptions.

minor comments (2)

[Notation and preliminaries] Notation for the structural misspecification parameters should be introduced once and used consistently across theorems and experiments.
[Abstract] The abstract paragraph on non-realizable settings could explicitly name the reward-relative characterization so readers immediately see the scope of the claimed online-IL advantage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope of our contributions on realizability in online vs. offline IL for LLM post-training. We address each major comment below with targeted revisions where appropriate.

read point-by-point responses

Referee: [Section introducing the structural characterization (and associated theorems)] The information-theoretic lower bound for offline IL (even at H=1) and the online-IL guarantee both rest on the proposed structural characterization of misspecification w.r.t. the reward; the manuscript does not supply evidence that this form (as opposed to token-level mismatch or optimization-induced misspecification) is the operative one in the LLM tasks considered.

Authors: We acknowledge that the manuscript introduces the reward-relative structural characterization primarily as a sufficient condition enabling the information-theoretic lower bound for offline IL (even at H=1) and the corresponding online IL guarantee, without direct empirical evidence that this specific form dominates over token-level or optimization-induced misspecification in the LLM tasks studied. The characterization is derived directly from the reward structure in post-training, where misspecification is defined relative to the expert's optimal policy under the reward rather than per-token. In the revision we will add a dedicated discussion subsection with concrete examples from common LLM reward models (e.g., how preference-based rewards induce reward-relative rather than token-level mismatch) and explicitly note that we do not claim exclusivity but that the condition supplies a principled account when it applies. This addresses the concern without altering the theorems. revision: partial
Referee: [Empirical evaluation section] The realizability experiments claim offline IL already matches expert performance, but the policy class, reward structure, and data-exclusion rules used to enforce realizability are not stated with sufficient precision to confirm that the empirical regime matches the theoretical assumptions.

Authors: We agree that the empirical section requires greater precision to confirm alignment with the theoretical realizability assumptions. In the revised manuscript we will explicitly state: (i) the policy class parameterization (model architecture, size, and fine-tuning procedure), (ii) the reward structure or simulation method defining the expert policy, and (iii) the exact data-exclusion rules used to enforce that the expert lies within the student class. These additions will allow direct verification that the experiments operate under the realizable regime assumed in the theory. revision: yes

Circularity Check

0 steps flagged

No circularity: results rest on stated proofs and proposed characterization

full rationale

The paper derives its claims via an empirical observation that offline IL matches expert performance under realizability, an information-theoretic lower bound proof for offline IL even at H=1 in non-realizable settings, and a proposed structural characterization of misspecification relative to the reward under which online IL is shown to succeed. These steps are presented as independent theoretical derivations and do not reduce by construction to fitted inputs, self-definitions, or self-citation chains. The analysis is self-contained against its own proofs and observations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard imitation learning assumptions such as bounded rewards and Markovian dynamics; no free parameters or invented entities are introduced in the abstract.

axioms (1)

standard math Standard assumptions in imitation learning theory such as bounded rewards and Markov decision process properties.
Invoked to establish the information-theoretic bottleneck and performance guarantees.

pith-pipeline@v0.9.1-grok · 5704 in / 1229 out tokens · 39203 ms · 2026-06-30T07:38:40.185813+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 32 canonical work pages · 21 internal anchors

[1]

and Ng, A

Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. InProceedings of the twenty-first international conference on Machine learning, pp. 1, 2004

2004
[2]

R., Geist, M., and Bachem, O

Agarwal, R., Vieillard, N., Zhou, Y., Stanczyk, P., Garea, S. R., Geist, M., and Bachem, O. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations, 2024

2024
[3]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

1901
[4]

Mitigating covariate shift in imitation learning via offline data with partial coverage.Advances in Neural Information Processing Systems, 34: 965–979, 2021

Chang, J., Uehara, M., Sreenivas, D., Kidambi, R., and Sun, W. Mitigating covariate shift in imitation learning via offline data with partial coverage.Advances in Neural Information Processing Systems, 34: 965–979, 2021

2021
[5]

Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting

Chen, H., Razin, N., Narasimhan, K., and Chen, D. Retaining by doing: The role of on-policy data in mitigating forgetting.arXiv preprint arXiv:2510.18874, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

and Jiang, N

Chen, J. and Jiang, N. Information-theoretic considerations in batch reinforcement learning. In International conference on machine learning, pp. 1042–1051. PMLR, 2019

2019
[7]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Chu, T., Zhai, Y., Yang, J., Tong, S., Xie, S., Schuurmans, D., Le, Q. V., Levine, S., and Ma, Y. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv preprint arXiv: 2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Efficient imitation under misspecification

Espinosa-Dice, N., Choudhury, S., Sun, W., and Swamy, G. Efficient imitation under misspecification. arXiv preprint arXiv:2503.13162, 2025

work page arXiv 2025
[11]

Open r1: A fully open reproduction of deepseek-r1, 2025

Face, H. Open r1: A fully open reproduction of deepseek-r1, 2025

2025
[12]

and Rakhlin, A

Foster, D. and Rakhlin, A. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. InInternational conference on machine learning, pp. 3199–3210. PMLR, 2020. 13

2020
[13]

Practical contextual bandits with regression oracles

Foster, D., Agarwal, A., Dudík, M., Luo, H., and Schapire, R. Practical contextual bandits with regression oracles. InInternational Conference on Machine Learning, pp. 1539–1548. PMLR, 2018

2018
[14]

Foster, D. J. and Rakhlin, A. Foundations of reinforcement learning and interactive decision making. arXiv preprint arXiv:2312.16730, 2023

work page arXiv 2023
[15]

J., Block, A., and Misra, D

Foster, D. J., Block, A., and Misra, D. Is behavior cloning all you need? understanding horizon in imitation learning.Advances in Neural Information Processing Systems, 37:120602–120666, 2024

2024
[16]

J., Mhammedi, Z., and Rohatgi, D

Foster, D. J., Mhammedi, Z., and Rohatgi, D. Is a good foundation necessary for efficient reinforcement learning? the computational role of the base model in exploration.arXiv preprint arXiv:2503.07453, 2025

work page arXiv 2025
[17]

Importance-weighted offline learning done right

Gabbianelli, G., Neu, G., and Papini, M. Importance-weighted offline learning done right. InInternational Conference on Algorithmic Learning Theory, pp. 614–634. PMLR, 2024

2024
[18]

Gandhi, K., Chakravarthy, A., Singh, A., Lile, N., and Goodman, N. D. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Minillm: Knowledge distillation of large language models

Gu, Y., Dong, L., Wei, F., and Huang, M. Minillm: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2024

2024
[21]

The false promise of imitating proprietary language models

Gudibande, A., Wallace, E., Snell, C., Geng, X., Liu, H., Abbeel, P., Levine, S., and Song, D. The false promise of imitating proprietary language models. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=Kz3yckpCN5

2024
[22]

Skywork Open Reasoner 1 Technical Report

He, J., Liu, J., Liu, C. Y., Yan, R., Wang, C., Cheng, P., Zhang, X., Zhang, F., Xu, J., Shen, W., et al. Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

2021
[24]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[25]

J., Rohatgi, D., Zhang, C., Simchowitz, M., Ash, J

Huang, A., Block, A., Foster, D. J., Rohatgi, D., Zhang, C., Simchowitz, M., Ash, J. T., and Krish- namurthy, A. Self-improvement in language models: The sharpening mechanism. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[26]

D., Sun, W., Krishnamurthy, A., and Foster, D

Huang, A., Zhan, W., Xie, T., Lee, J. D., Sun, W., Krishnamurthy, A., and Foster, D. J. Correcting the mythos of kl-regularization: Direct alignment without overoptimization via chi-squared preference optimization. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[27]

Teach small models to reason by curriculum distillation

Jiang, W., Lu, Y., Lin, H., Han, X., and Sun, L. Teach small models to reason by curriculum distillation. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.),Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 7412–7422, Suzhou, China, November 2025. Association for Computational Linguistics. I...

work page doi:10.18653/v1/2025.emnlp-main.376 2025
[28]

D., and Jun, K.-S

Kim, J., Yun, J., Lee, J. D., and Jun, K.-S. Coverage improvement and fast convergence of on-policy preference learning.arXiv preprint arXiv:2601.08421, 2026

work page arXiv 2026
[29]

and Rush, A

Kim, Y. and Rush, A. M. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pp. 1317–1327, 2016. 14

2016
[30]

M., Ma, T., and Liang, P

Kumar, A., Raghunathan, A., Jones, R. M., Ma, T., and Liang, P. Fine-tuning can distort pretrained features and underperform out-of-distribution. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URLhttps: //openreview.net/forum?id=UYneFzXSJWh

2022
[31]

H., Gonzalez, J

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023
[32]

Y., Ramasubramanian, B., and Poovendran, R

Li, Y., Yue, X., Xu, Z., Jiang, F., Niu, L., Lin, B. Y., Ramasubramanian, B., and Poovendran, R. Small models struggle to learn from strong reasoners.arXiv preprint arXiv:2502.12143, 2025

work page arXiv 2025
[33]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Li, Y., Zuo, Y., He, B., Zhang, J., Xiao, C., Qian, C., Yu, T., Gao, H.-a., Yang, W., Liu, Z., et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe. arXiv preprint arXiv:2604.13016, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

DeepSeek-V3 Technical Report

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

and Lab, T

Lu, K. and Lab, T. M. On-policy distillation.Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation

work page doi:10.64434/tml.20251026 2025
[36]

Y., Roongta, M., Cai, C., Luo, J., Zhang, T., Li, L

Luo, M., Tan, S., Wong, J., Shi, X., Tang, W. Y., Roongta, M., Cai, C., Luo, J., Zhang, T., Li, L. E., et al. Deepscaler: Surpassing o1-preview with a 1.5 b model by scaling rl.Notion Blog, 2025

2025
[37]

Error bounds for approximate policy iteration

Munos, R. Error bounds for approximate policy iteration. InProceedings of the Twentieth International Conference on International Conference on Machine Learning, pp. 560–567, 2003

2003
[38]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022
[39]

Tinyzero

Pan, J., Zhang, J., Wang, X., Yuan, L., Peng, H., and Suhr, A. Tinyzero. https://github.com/Jiayi- Pan/TinyZero, 2025. Accessed: 2025-01-24

2025
[40]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PmLR, 2021

2021
[41]

D., Ermon, S., and Finn, C

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

2023
[42]

Toward the fundamental limits of imitation learning.Advances in Neural Information Processing Systems, 33:2914–2924, 2020

Rajaraman, N., Yang, L., Jiao, J., and Ramchandran, K. Toward the fundamental limits of imitation learning.Advances in Neural Information Processing Systems, 33:2914–2924, 2020

2020
[43]

On the value of interaction and function approximation in imitation learning

Rajaraman, N., Han, Y., Yang, L., Liu, J., Jiao, J., and Ramchandran, K. On the value of interaction and function approximation in imitation learning. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.),Advances in Neural Information Processing Systems, volume 34, pp. 1325–1336. Curran Associates, Inc., 2021. URLhttps://proc...

2021
[44]

Bridging offline reinforcement learning and imitation learning: A tale of pessimism.Advances in Neural Information Processing Systems, 34: 11702–11716, 2021

Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S. Bridging offline reinforcement learning and imitation learning: A tale of pessimism.Advances in Neural Information Processing Systems, 34: 11702–11716, 2021

2021
[45]

Rohatgi, D., Block, A., Huang, A., Krishnamurthy, A., and Foster, D. J. Computational-statistical trade- offs at the next-token prediction barrier: Autoregressive and imitation learning under misspecification. arXiv preprint arXiv:2502.12465, 2025. 15

work page arXiv 2025
[46]

and Bagnell, D

Ross, S. and Bagnell, D. Efficient reductions for imitation learning. InProceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 661–668. JMLR Workshop and Conference Proceedings, 2010

2010
[47]

Reinforcement and Imitation Learning via Interactive No-Regret Learning

Ross, S. and Bagnell, J. A. Reinforcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[48]

A reduction of imitation learning and structured prediction to no-regret online learning

Ross, S., Gordon, G., and Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. JMLR Workshop and Conference Proceedings, 2011

2011
[49]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y. K., Wu, Y., and Guo, D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv: 2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

RL's Razor: Why Online Reinforcement Learning Forgets Less

Shenfeld, I., Pari, J., and Agrawal, P. Rl’s razor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

HybridFlow: A Flexible and Efficient RLHF Framework

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., and Wu, C. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

and Xu, Y

Simchi-Levi, D. and Xu, Y. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability.Mathematics of Operations Research, 47(3):1904–1931, 2022

1904
[53]

Song, Y., Rohatgi, D., Singh, A., and Bagnell, J. A. To distill or decide? understanding the algorithmic trade-off in partially observable reinforcement learning.arXiv preprint arXiv: 2510.03207, 2025

work page arXiv 2025
[54]

and Joachims, T

Swaminathan, A. and Joachims, T. The self-normalized estimator for counterfactual learning.advances in neural information processing systems, 28, 2015

2015
[55]

A., Wu, S., Jiao, J., and Ramchandran, K

Swamy, G., Rajaraman, N., Peng, M., Choudhury, S., Bagnell, J. A., Wu, S., Jiao, J., and Ramchandran, K. Minimax optimal online imitation learning via replay estimation. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.),Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing ...

2022
[56]

and Schapire, R

Syed, U. and Schapire, R. E. A game-theoretic approach to apprenticeship learning.Advances in neural information processing systems, 20, 2007

2007
[57]

Team, Q. Qwen2. 5: A party of foundation models, september 2024.URL https://qwenlm. github. io/blog/qwen2, 5(4), 2024

2024
[58]

TRL: Transformers Reinforcement Learning, 2020

von Werra, L., Belkada, Y., Tunstall, L., Beeching, E., Thrush, T., Lambert, N., Huang, S., Rasul, K., and Gallouédec, Q. TRL: Transformers Reinforcement Learning, 2020. URLhttps://github.com/ huggingface/trl

2020
[59]

Oracle-efficient pessimism: Offline policy optimization in contextual bandits

Wang, L., Krishnamurthy, A., and Slivkins, A. Oracle-efficient pessimism: Offline policy optimization in contextual bandits. InInternational Conference on Artificial Intelligence and Statistics, pp. 766–774. PMLR, 2024

2024
[60]

MiMo-V2-Flash Technical Report

Xiao, B., Xia, B., Yang, B., Gao, B., Shen, B., Zhang, C., He, C., Lou, C., Luo, F., Wang, G., et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[61]

J., Krishnamurthy, A., Rosset, C., Awadallah, A

Xie, T., Foster, D. J., Krishnamurthy, A., Rosset, C., Awadallah, A. H., and Rakhlin, A. Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[62]

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint.arXiv preprint arXiv:2312.11456, 2023

Xiong, W., Dong, H., Ye, C., Wang, Z., Zhong, H., Ji, H., Jiang, N., and Zhang, T. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint.arXiv preprint arXiv:2312.11456, 2023. 16

work page arXiv 2023
[63]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Yang, W., Liu, W., Xie, R., Yang, K., Yang, S., and Lin, Y. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[65]

Black-box on-policy distillation of large language models.CoRR, abs/2511.10643, 2025

Ye, T., Dong, L., Chi, Z., Wu, X., Huang, S., and Wei, F. Black-box on-policy distillation of large language models.CoRR, abs/2511.10643, 2025. doi: 10.48550/ARXIV.2511.10643. URL https: //doi.org/10.48550/arXiv.2511.10643

work page doi:10.48550/arxiv.2511.10643 2025
[66]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Zeng, W., Huang, Y., Liu, Q., Liu, W., He, K., Ma, Z., and He, J. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Zhao, S., Xie, Z., Liu, M., Huang, J., Pang, G., Chen, F., and Grover, A. Self-distilled reasoner: On-policy self-distillationforlargelanguagemodels.CoRR,abs/2601.18734, 2026. doi: 10.48550/ARXIV.2601.18734. URLhttps://doi.org/10.48550/arXiv.2601.18734

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.18734 2026
[69]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Zheng, Y., Zhang, R., Zhang, J., Ye, Y., Luo, Z., Feng, Z., and Ma, Y. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Linguistics. URLhttp://arxiv.org/abs/...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

ζ 2 l−1X i=1 1[i̸∈ X(D)] # ≥ 1 2 E

Ziebart, B. D., Maas, A. L., Bagnell, J. A., Dey, A. K., et al. Maximum entropy inverse reinforcement learning. InAaai, volume 8, pp. 1433–1438. Chicago, IL, USA, 2008. 17 A Proofs and Additional Results A.1 Proof of Theorem 1 Without loss of generality, assume thatK := |Π| = 2l for some l∈N ∗. If not, letK ′ = 2⌊log2 K⌋ ≤K . We construct the hard instanc...

2008
[71]

Warmup Supervised Fine-Tuning.To enable RL training, we first perform supervised fine-tuning on the OpenR1 dataset [11] to strengthen the model’s reasoning capabilities

as the base model. Warmup Supervised Fine-Tuning.To enable RL training, we first perform supervised fine-tuning on the OpenR1 dataset [11] to strengthen the model’s reasoning capabilities. We use the LlamaFactory [69] framework and train for3,064steps with batch size64and learning rate10 −5 using the AdamW optimizer. RL Training Details.We then perform RL...

2025

[1] [1]

and Ng, A

Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. InProceedings of the twenty-first international conference on Machine learning, pp. 1, 2004

2004

[2] [2]

R., Geist, M., and Bachem, O

Agarwal, R., Vieillard, N., Zhou, Y., Stanczyk, P., Garea, S. R., Geist, M., and Bachem, O. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations, 2024

2024

[3] [3]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

1901

[4] [4]

Mitigating covariate shift in imitation learning via offline data with partial coverage.Advances in Neural Information Processing Systems, 34: 965–979, 2021

Chang, J., Uehara, M., Sreenivas, D., Kidambi, R., and Sun, W. Mitigating covariate shift in imitation learning via offline data with partial coverage.Advances in Neural Information Processing Systems, 34: 965–979, 2021

2021

[5] [5]

Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting

Chen, H., Razin, N., Narasimhan, K., and Chen, D. Retaining by doing: The role of on-policy data in mitigating forgetting.arXiv preprint arXiv:2510.18874, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

and Jiang, N

Chen, J. and Jiang, N. Information-theoretic considerations in batch reinforcement learning. In International conference on machine learning, pp. 1042–1051. PMLR, 2019

2019

[7] [7]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Chu, T., Zhai, Y., Yang, J., Tong, S., Xie, S., Schuurmans, D., Le, Q. V., Levine, S., and Ma, Y. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv preprint arXiv: 2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Efficient imitation under misspecification

Espinosa-Dice, N., Choudhury, S., Sun, W., and Swamy, G. Efficient imitation under misspecification. arXiv preprint arXiv:2503.13162, 2025

work page arXiv 2025

[11] [11]

Open r1: A fully open reproduction of deepseek-r1, 2025

Face, H. Open r1: A fully open reproduction of deepseek-r1, 2025

2025

[12] [12]

and Rakhlin, A

Foster, D. and Rakhlin, A. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. InInternational conference on machine learning, pp. 3199–3210. PMLR, 2020. 13

2020

[13] [13]

Practical contextual bandits with regression oracles

Foster, D., Agarwal, A., Dudík, M., Luo, H., and Schapire, R. Practical contextual bandits with regression oracles. InInternational Conference on Machine Learning, pp. 1539–1548. PMLR, 2018

2018

[14] [14]

Foster, D. J. and Rakhlin, A. Foundations of reinforcement learning and interactive decision making. arXiv preprint arXiv:2312.16730, 2023

work page arXiv 2023

[15] [15]

J., Block, A., and Misra, D

Foster, D. J., Block, A., and Misra, D. Is behavior cloning all you need? understanding horizon in imitation learning.Advances in Neural Information Processing Systems, 37:120602–120666, 2024

2024

[16] [16]

J., Mhammedi, Z., and Rohatgi, D

Foster, D. J., Mhammedi, Z., and Rohatgi, D. Is a good foundation necessary for efficient reinforcement learning? the computational role of the base model in exploration.arXiv preprint arXiv:2503.07453, 2025

work page arXiv 2025

[17] [17]

Importance-weighted offline learning done right

Gabbianelli, G., Neu, G., and Papini, M. Importance-weighted offline learning done right. InInternational Conference on Algorithmic Learning Theory, pp. 614–634. PMLR, 2024

2024

[18] [18]

Gandhi, K., Chakravarthy, A., Singh, A., Lile, N., and Goodman, N. D. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Minillm: Knowledge distillation of large language models

Gu, Y., Dong, L., Wei, F., and Huang, M. Minillm: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2024

2024

[21] [21]

The false promise of imitating proprietary language models

Gudibande, A., Wallace, E., Snell, C., Geng, X., Liu, H., Abbeel, P., Levine, S., and Song, D. The false promise of imitating proprietary language models. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=Kz3yckpCN5

2024

[22] [22]

Skywork Open Reasoner 1 Technical Report

He, J., Liu, J., Liu, C. Y., Yan, R., Wang, C., Cheng, P., Zhang, X., Zhang, F., Xu, J., Shen, W., et al. Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

2021

[24] [24]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[25] [25]

J., Rohatgi, D., Zhang, C., Simchowitz, M., Ash, J

Huang, A., Block, A., Foster, D. J., Rohatgi, D., Zhang, C., Simchowitz, M., Ash, J. T., and Krish- namurthy, A. Self-improvement in language models: The sharpening mechanism. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[26] [26]

D., Sun, W., Krishnamurthy, A., and Foster, D

Huang, A., Zhan, W., Xie, T., Lee, J. D., Sun, W., Krishnamurthy, A., and Foster, D. J. Correcting the mythos of kl-regularization: Direct alignment without overoptimization via chi-squared preference optimization. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[27] [27]

Teach small models to reason by curriculum distillation

Jiang, W., Lu, Y., Lin, H., Han, X., and Sun, L. Teach small models to reason by curriculum distillation. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.),Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 7412–7422, Suzhou, China, November 2025. Association for Computational Linguistics. I...

work page doi:10.18653/v1/2025.emnlp-main.376 2025

[28] [28]

D., and Jun, K.-S

Kim, J., Yun, J., Lee, J. D., and Jun, K.-S. Coverage improvement and fast convergence of on-policy preference learning.arXiv preprint arXiv:2601.08421, 2026

work page arXiv 2026

[29] [29]

and Rush, A

Kim, Y. and Rush, A. M. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pp. 1317–1327, 2016. 14

2016

[30] [30]

M., Ma, T., and Liang, P

Kumar, A., Raghunathan, A., Jones, R. M., Ma, T., and Liang, P. Fine-tuning can distort pretrained features and underperform out-of-distribution. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URLhttps: //openreview.net/forum?id=UYneFzXSJWh

2022

[31] [31]

H., Gonzalez, J

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023

[32] [32]

Y., Ramasubramanian, B., and Poovendran, R

Li, Y., Yue, X., Xu, Z., Jiang, F., Niu, L., Lin, B. Y., Ramasubramanian, B., and Poovendran, R. Small models struggle to learn from strong reasoners.arXiv preprint arXiv:2502.12143, 2025

work page arXiv 2025

[33] [33]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Li, Y., Zuo, Y., He, B., Zhang, J., Xiao, C., Qian, C., Yu, T., Gao, H.-a., Yang, W., Liu, Z., et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe. arXiv preprint arXiv:2604.13016, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[34] [34]

DeepSeek-V3 Technical Report

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

and Lab, T

Lu, K. and Lab, T. M. On-policy distillation.Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation

work page doi:10.64434/tml.20251026 2025

[36] [36]

Y., Roongta, M., Cai, C., Luo, J., Zhang, T., Li, L

Luo, M., Tan, S., Wong, J., Shi, X., Tang, W. Y., Roongta, M., Cai, C., Luo, J., Zhang, T., Li, L. E., et al. Deepscaler: Surpassing o1-preview with a 1.5 b model by scaling rl.Notion Blog, 2025

2025

[37] [37]

Error bounds for approximate policy iteration

Munos, R. Error bounds for approximate policy iteration. InProceedings of the Twentieth International Conference on International Conference on Machine Learning, pp. 560–567, 2003

2003

[38] [38]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022

[39] [39]

Tinyzero

Pan, J., Zhang, J., Wang, X., Yuan, L., Peng, H., and Suhr, A. Tinyzero. https://github.com/Jiayi- Pan/TinyZero, 2025. Accessed: 2025-01-24

2025

[40] [40]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PmLR, 2021

2021

[41] [41]

D., Ermon, S., and Finn, C

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

2023

[42] [42]

Toward the fundamental limits of imitation learning.Advances in Neural Information Processing Systems, 33:2914–2924, 2020

Rajaraman, N., Yang, L., Jiao, J., and Ramchandran, K. Toward the fundamental limits of imitation learning.Advances in Neural Information Processing Systems, 33:2914–2924, 2020

2020

[43] [43]

On the value of interaction and function approximation in imitation learning

Rajaraman, N., Han, Y., Yang, L., Liu, J., Jiao, J., and Ramchandran, K. On the value of interaction and function approximation in imitation learning. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.),Advances in Neural Information Processing Systems, volume 34, pp. 1325–1336. Curran Associates, Inc., 2021. URLhttps://proc...

2021

[44] [44]

Bridging offline reinforcement learning and imitation learning: A tale of pessimism.Advances in Neural Information Processing Systems, 34: 11702–11716, 2021

Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S. Bridging offline reinforcement learning and imitation learning: A tale of pessimism.Advances in Neural Information Processing Systems, 34: 11702–11716, 2021

2021

[45] [45]

Rohatgi, D., Block, A., Huang, A., Krishnamurthy, A., and Foster, D. J. Computational-statistical trade- offs at the next-token prediction barrier: Autoregressive and imitation learning under misspecification. arXiv preprint arXiv:2502.12465, 2025. 15

work page arXiv 2025

[46] [46]

and Bagnell, D

Ross, S. and Bagnell, D. Efficient reductions for imitation learning. InProceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 661–668. JMLR Workshop and Conference Proceedings, 2010

2010

[47] [47]

Reinforcement and Imitation Learning via Interactive No-Regret Learning

Ross, S. and Bagnell, J. A. Reinforcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[48] [48]

A reduction of imitation learning and structured prediction to no-regret online learning

Ross, S., Gordon, G., and Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. JMLR Workshop and Conference Proceedings, 2011

2011

[49] [49]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y. K., Wu, Y., and Guo, D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv: 2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

RL's Razor: Why Online Reinforcement Learning Forgets Less

Shenfeld, I., Pari, J., and Agrawal, P. Rl’s razor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

HybridFlow: A Flexible and Efficient RLHF Framework

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., and Wu, C. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

and Xu, Y

Simchi-Levi, D. and Xu, Y. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability.Mathematics of Operations Research, 47(3):1904–1931, 2022

1904

[53] [53]

Song, Y., Rohatgi, D., Singh, A., and Bagnell, J. A. To distill or decide? understanding the algorithmic trade-off in partially observable reinforcement learning.arXiv preprint arXiv: 2510.03207, 2025

work page arXiv 2025

[54] [54]

and Joachims, T

Swaminathan, A. and Joachims, T. The self-normalized estimator for counterfactual learning.advances in neural information processing systems, 28, 2015

2015

[55] [55]

A., Wu, S., Jiao, J., and Ramchandran, K

Swamy, G., Rajaraman, N., Peng, M., Choudhury, S., Bagnell, J. A., Wu, S., Jiao, J., and Ramchandran, K. Minimax optimal online imitation learning via replay estimation. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.),Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing ...

2022

[56] [56]

and Schapire, R

Syed, U. and Schapire, R. E. A game-theoretic approach to apprenticeship learning.Advances in neural information processing systems, 20, 2007

2007

[57] [57]

Team, Q. Qwen2. 5: A party of foundation models, september 2024.URL https://qwenlm. github. io/blog/qwen2, 5(4), 2024

2024

[58] [58]

TRL: Transformers Reinforcement Learning, 2020

von Werra, L., Belkada, Y., Tunstall, L., Beeching, E., Thrush, T., Lambert, N., Huang, S., Rasul, K., and Gallouédec, Q. TRL: Transformers Reinforcement Learning, 2020. URLhttps://github.com/ huggingface/trl

2020

[59] [59]

Oracle-efficient pessimism: Offline policy optimization in contextual bandits

Wang, L., Krishnamurthy, A., and Slivkins, A. Oracle-efficient pessimism: Offline policy optimization in contextual bandits. InInternational Conference on Artificial Intelligence and Statistics, pp. 766–774. PMLR, 2024

2024

[60] [60]

MiMo-V2-Flash Technical Report

Xiao, B., Xia, B., Yang, B., Gao, B., Shen, B., Zhang, C., He, C., Lou, C., Luo, F., Wang, G., et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[61] [61]

J., Krishnamurthy, A., Rosset, C., Awadallah, A

Xie, T., Foster, D. J., Krishnamurthy, A., Rosset, C., Awadallah, A. H., and Rakhlin, A. Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[62] [62]

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint.arXiv preprint arXiv:2312.11456, 2023

Xiong, W., Dong, H., Ye, C., Wang, Z., Zhong, H., Ji, H., Jiang, N., and Zhang, T. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint.arXiv preprint arXiv:2312.11456, 2023. 16

work page arXiv 2023

[63] [63]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[64] [64]

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Yang, W., Liu, W., Xie, R., Yang, K., Yang, S., and Lin, Y. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[65] [65]

Black-box on-policy distillation of large language models.CoRR, abs/2511.10643, 2025

Ye, T., Dong, L., Chi, Z., Wu, X., Huang, S., and Wei, F. Black-box on-policy distillation of large language models.CoRR, abs/2511.10643, 2025. doi: 10.48550/ARXIV.2511.10643. URL https: //doi.org/10.48550/arXiv.2511.10643

work page doi:10.48550/arxiv.2511.10643 2025

[66] [66]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[67] [67]

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Zeng, W., Huang, Y., Liu, Q., Liu, W., He, K., Ma, Z., and He, J. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[68] [68]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Zhao, S., Xie, Z., Liu, M., Huang, J., Pang, G., Chen, F., and Grover, A. Self-distilled reasoner: On-policy self-distillationforlargelanguagemodels.CoRR,abs/2601.18734, 2026. doi: 10.48550/ARXIV.2601.18734. URLhttps://doi.org/10.48550/arXiv.2601.18734

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.18734 2026

[69] [69]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Zheng, Y., Zhang, R., Zhang, J., Ye, Y., Luo, Z., Feng, Z., and Ma, Y. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Linguistics. URLhttp://arxiv.org/abs/...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[70] [70]

ζ 2 l−1X i=1 1[i̸∈ X(D)] # ≥ 1 2 E

Ziebart, B. D., Maas, A. L., Bagnell, J. A., Dey, A. K., et al. Maximum entropy inverse reinforcement learning. InAaai, volume 8, pp. 1433–1438. Chicago, IL, USA, 2008. 17 A Proofs and Additional Results A.1 Proof of Theorem 1 Without loss of generality, assume thatK := |Π| = 2l for some l∈N ∗. If not, letK ′ = 2⌊log2 K⌋ ≤K . We construct the hard instanc...

2008

[71] [71]

Warmup Supervised Fine-Tuning.To enable RL training, we first perform supervised fine-tuning on the OpenR1 dataset [11] to strengthen the model’s reasoning capabilities

as the base model. Warmup Supervised Fine-Tuning.To enable RL training, we first perform supervised fine-tuning on the OpenR1 dataset [11] to strengthen the model’s reasoning capabilities. We use the LlamaFactory [69] framework and train for3,064steps with batch size64and learning rate10 −5 using the AdamW optimizer. RL Training Details.We then perform RL...

2025