$\boldsymbol{f}$-OPD: Stabilizing Long-Horizon On-Policy Distillation with Freshness-Aware Control

Jibin Wu; Shimin Zhang; Xianwei Chen

arxiv: 2605.17862 · v1 · pith:Q3QDWVBTnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

boldsymbol{f}-OPD: Stabilizing Long-Horizon On-Policy Distillation with Freshness-Aware Control

Xianwei Chen , Shimin Zhang , Jibin Wu This is my paper

Pith reviewed 2026-05-20 12:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords on-policy distillationasynchronous trainingfreshness scorepolicy driftLLM agentslong-horizon tasksreasoningtool use

0 comments

The pith

A freshness score lets asynchronous on-policy distillation match synchronous performance on long-horizon agent tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper confronts the efficiency-performance tension in scaling on-policy distillation for large language models, where asynchronous execution is required for throughput but creates objective deviations through stale samples. It decomposes these deviations into rollout drift from the student and supervision drift from the teacher, then defines a sample-level freshness score to measure each buffered sample's reliability relative to the ideal on-policy objective. The f-OPD framework uses this score to adaptively limit the influence of stale samples and thereby bound accumulated policy drift. A sympathetic reader would care because the result shows that long-horizon reasoning, tool-use, and coding-agent tasks can be trained at scale without forcing a choice between speed and correctness.

Core claim

f-OPD stabilizes long-horizon on-policy distillation by theoretically decomposing the objective discrepancy into rollout drift and supervision drift, introducing a sample-level freshness score that quantifies the reliability of buffered samples with respect to the on-policy objective, and adaptively regulating stale-sample influence to constrain accumulated policy drift under asynchronous execution, thereby achieving task performance comparable to synchronous optimization while largely retaining the throughput advantages of asynchronous execution.

What carries the argument

The sample-level freshness score, which quantifies how much a buffered sample deviates from the ideal on-policy objective and guides adaptive regulation of its influence within the f-OPD framework.

If this is right

Asynchronous execution becomes viable for on-policy distillation without large performance penalties on extended interaction horizons.
Policy drift can be controlled by weighting samples according to measured staleness rather than by enforcing strict synchronization.
The same freshness signal can be applied across reasoning, tool-use, and coding-agent tasks as horizon length increases.
Throughput gains from asynchrony are preserved while task success rates remain comparable to fully synchronous baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The freshness mechanism may transfer to other asynchronous reinforcement-learning or distillation pipelines that suffer from rollout or supervision staleness.
Combining the score with existing replay-buffer techniques could further reduce the synchronization overhead in distributed agent training.
If the decomposition into rollout and supervision drift proves general, similar freshness controls could stabilize longer-horizon post-training without additional compute.
The approach suggests that sample reliability signals, rather than global synchronization, may become the default way to manage drift in large-scale agentic training.

Load-bearing premise

The sample-level freshness score reliably quantifies deviation from the ideal on-policy objective and adaptively regulating its influence is sufficient to constrain accumulated policy drift under asynchronous execution.

What would settle it

Running f-OPD on the same long-horizon tasks without the freshness-based regulation and observing whether performance collapses to the level of naive asynchronous distillation, or running it with regulation and seeing whether performance still falls substantially below synchronous optimization, would settle the central claim.

Figures

Figures reproduced from arXiv: 2605.17862 by Jibin Wu, Shimin Zhang, Xianwei Chen.

**Figure 2.** Figure 2: Systematic overview of f-OPD. Top: three sample-level diagnostics used to characterize staleness. Bottom: the overall f-OPD pipeline, where sample freshness is estimated from these diagnostics and integrated into OPD optimization through three complementary mechanisms. 4 Freshness-Aware Control for Long-Horizon OPD Our analysis in Sec. 3 has decomposed asynchronous OPD staleness as rollout and supervision … view at source ↗

**Figure 3.** Figure 3: Failure modes of vanilla OPD under increasing policy update lag. (a) Relative task [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: (a–c) Training dynamics across tasks for synchronous OPD, asynchronous OPD, and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Supplementary entropy dynamics under fixed lag values. Unlike Figure 3(c), which reports [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Supplementary behavioral failure analysis for the long-horizon coding-agent setting. Bars [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

read the original abstract

Scaling on-policy distillation (OPD) for large language models (LLMs) confronts a fundamental tension: asynchronous execution is necessary for system efficiency, but structurally deviates from the ideal on-policy objective. To address this challenge, we theoretically decompose the objective discrepancy into rollout drift and supervision drift, capturing staleness in student rollout and teacher context, respectively. Building on this, we introduce a sample-level freshness score that quantifies the reliability of a buffered sample with respect to the on-policy objective. Guided by this signal, we further propose f-OPD, a novel framework that adaptively regulates stale-sample influence and constrains policy drift accumulated under asynchronous training. Across reasoning, tool-use, and coding-agent tasks of increasing interaction horizon, f-OPD consistently achieves task performance comparable to synchronous optimization while largely retaining the throughput advantages of asynchronous execution. Our results establish the first recipe for achieving a performance-efficiency trade-off in OPD, paving the way for long-horizon agentic post-training at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

f-OPD gives a workable freshness-based regulator that lets async on-policy distillation hold performance on long-horizon agent tasks while keeping most of the speed.

read the letter

The main point is that this paper supplies a concrete control for asynchronous on-policy distillation in LLM agents. They split the objective gap into rollout drift from stale student samples and supervision drift from the teacher, then build a per-sample freshness score to down-weight unreliable buffers and limit accumulated policy shift. That framing and the resulting f-OPD rule look like the actual new piece, and the experiments claim it matches synchronous performance on reasoning, tool-use, and coding tasks as interaction length increases while preserving async throughput gains. If the numbers and controls check out, that is a practical result for anyone trying to scale agent post-training without paying the full synchronous cost. The paper does a decent job laying out why async is needed for efficiency and showing results across horizons that matter for real agent workloads. The soft spots are in the supporting detail. The abstract and high-level description do not include the exact freshness formula or a derivation showing it actually bounds the drift, so it is still unclear how sensitive the method is to the choice of score or how much drift is reduced in practice. The results are reported on specific task families; without seeing the full ablations or failure cases it is hard to know whether the regulation generalizes or mainly fits these setups. The central assumption that the freshness signal is a reliable proxy for on-policy quality is plausible but not yet stress-tested in the visible material. This work is aimed at people building and training long-horizon LLM agents who already run distillation pipelines and care about throughput. A reader focused on practical scaling would find the recipe useful even if they end up tweaking the details. It is coherent enough and addresses a real deployment tension, so it deserves a serious referee rather than a desk reject, though any review should press for the missing equations and more targeted ablations before acceptance.

Referee Report

2 major / 2 minor

Summary. The paper claims to resolve the efficiency-performance tension in scaling on-policy distillation (OPD) for LLMs by theoretically decomposing the objective discrepancy into rollout drift (staleness in student rollout) and supervision drift (staleness in teacher context). It introduces a sample-level freshness score to quantify buffered-sample reliability relative to the on-policy objective and proposes the f-OPD framework that adaptively regulates stale-sample influence to constrain accumulated policy drift under asynchronous execution. Empirical results across reasoning, tool-use, and coding-agent tasks of increasing interaction horizon show task performance comparable to synchronous optimization while largely retaining asynchronous throughput advantages.

Significance. If the decomposition and freshness-aware regulation prove effective at controlling policy drift without introducing new biases, the work would offer a practical and principled recipe for the performance-efficiency trade-off in long-horizon OPD. This could meaningfully advance scalable agentic post-training by allowing asynchronous execution without sacrificing on-policy fidelity, with the multi-task empirical validation providing initial evidence of generality.

major comments (2)

[§3.1] §3.1 (theoretical decomposition): The central claim that the objective discrepancy decomposes cleanly into rollout drift and supervision drift is load-bearing for the entire f-OPD construction; the manuscript should provide the explicit derivation (including any assumptions on the policy update and buffer dynamics) to confirm that the two terms are exhaustive and non-overlapping.
[§4.2] §4.2 (freshness score definition): The sample-level freshness score is presented as reliably quantifying deviation from the ideal on-policy objective, yet the weakest assumption in the work is that adaptively regulating its influence is sufficient to bound accumulated drift; an ablation or sensitivity analysis showing how performance degrades when the score is replaced by a simpler heuristic (e.g., age only) would strengthen this claim.

minor comments (2)

[Abstract] The abstract states the performance claims without any quantitative numbers or confidence intervals; adding a brief summary of the key metrics (e.g., success rate deltas and throughput ratios) would improve readability.
[§3 and §5] Notation for the freshness score and the two drift terms should be introduced once in §3 and used consistently thereafter; occasional redefinition in the experimental section creates unnecessary ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address each major comment below and will update the manuscript accordingly.

read point-by-point responses

Referee: [§3.1] §3.1 (theoretical decomposition): The central claim that the objective discrepancy decomposes cleanly into rollout drift and supervision drift is load-bearing for the entire f-OPD construction; the manuscript should provide the explicit derivation (including any assumptions on the policy update and buffer dynamics) to confirm that the two terms are exhaustive and non-overlapping.

Authors: We agree that an explicit derivation strengthens the foundation of the decomposition. In the revised manuscript we will expand §3.1 with the full step-by-step derivation, stating the assumptions on policy updates and buffer dynamics, and showing that rollout drift and supervision drift are exhaustive and non-overlapping under those conditions. revision: yes
Referee: [§4.2] §4.2 (freshness score definition): The sample-level freshness score is presented as reliably quantifying deviation from the ideal on-policy objective, yet the weakest assumption in the work is that adaptively regulating its influence is sufficient to bound accumulated drift; an ablation or sensitivity analysis showing how performance degrades when the score is replaced by a simpler heuristic (e.g., age only) would strengthen this claim.

Authors: We acknowledge that an ablation against a simpler baseline such as age-only regulation would provide useful supporting evidence. We will add this ablation to the revised version, reporting performance when the freshness score is replaced by sample age across the reasoning, tool-use, and coding-agent tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The abstract and available description present a theoretical decomposition of objective discrepancy into rollout drift and supervision drift as an independent analytical step, followed by the definition of a sample-level freshness score and the proposal of the f-OPD framework for adaptive regulation. No equations, fitted parameters, or self-citations are exhibited that reduce any claimed prediction or result back to its own inputs by construction. The performance claims rest on empirical comparisons across tasks rather than on a self-referential loop, rendering the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified assumption that the proposed freshness score can be computed from observable quantities and that its use directly controls policy drift; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption The objective discrepancy in asynchronous on-policy distillation can be decomposed into rollout drift and supervision drift.
This decomposition is presented as the theoretical starting point for the freshness score.

pith-pipeline@v0.9.0 · 5712 in / 1186 out tokens · 44454 ms · 2026-05-20T12:23:11.111793+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We theoretically decompose the objective discrepancy into rollout drift and supervision drift... introduce a sample-level freshness score fi = 1/(τi + 1) exp(−e∆t_i) ... J(πt_θ) = E [σ(fi − ξ)(ℓt_i + λ Ranchor_i)]
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 3.2 (Two-fold Decomposition... ∆t ≤ Croll TV(dt, dstale,t) + Csup E[TV(πteacher(ct_i), πteacher(cr(i)_i))]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 26 internal anchors

[1]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Deepseek-v4: Towards highly efficient million-token context intelligence

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence. https://hu ggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf , 2026. technical report

work page 2026
[6]

MiMo-V2-Flash Technical Report

Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv:2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

slime: An llm post-training framework for rl scaling.https://github.com/THUDM/slime, 2025

Zilin Zhu, Chengxing Xie, Xin Lv, and slime Contributors. slime: An llm post-training framework for rl scaling.https://github.com/THUDM/slime, 2025. GitHub repository, accessed 2026-05-06

work page 2025
[10]

Nemo rl: A scalable and efficient post-training library.https://github.com/NVIDIA-NeMo/RL,

Nvidia. Nemo rl: A scalable and efficient post-training library.https://github.com/NVIDIA-NeMo/RL,

work page
[11]

Scheduled sampling for sequence prediction with recurrent neural networks.Advances in neural information processing systems, 28, 2015

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks.Advances in neural information processing systems, 28, 2015

work page 2015
[12]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Relax: Reasoning with latent exploration for large reasoning models.arXiv preprint arXiv:2512.07558, 2025

Shimin Zhang, Xianwei Chen, Yufan Shen, Ziyuan Ye, and Jibin Wu. Relax: Reasoning with latent exploration for large reasoning models.arXiv preprint arXiv:2512.07558, 2025

work page arXiv 2025
[15]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

MiniLLM: On-policy distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: On-policy distillation of large language models. InInternational Conference on Learning Representations, 2024

work page 2024
[17]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In International Conference on Learning Representations, 2024. 10

work page 2024
[18]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan- ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self- distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

OPSDL: On-Policy Self-Distillation for Long-Context Language Models

Xinsen Zhang, Zhenkai Ding, Tianjun Pan, Run Yang, Chun Kang, Xue Xiong, and Jingnan Gu. Opsdl: On-policy self-distillation for long-context language models.arXiv preprint arXiv:2604.17535, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. Crisp: Compressed reasoning via iterative self-policy distillation.arXiv preprint arXiv:2603.05433, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E

Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137, 2026

work page arXiv 2026
[24]

On-Policy Context Distillation for Language Models

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Black-box on-policy distillation of large language models.arXiv preprint, arXiv:2511.10643, 2025

Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, and Furu Wei. Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643, 2025

work page arXiv 2025
[28]

Entropy-aware on-policy distillation of language models

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models.arXiv preprint arXiv:2603.07079, 2026. Also available on OpenReview as SPOT 2026

work page arXiv 2026
[29]

Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D. Lyng, Sanjit Singh Batra, and Robert E. Tillman. Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

work page arXiv 2026
[30]

Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu

V olodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, 2016

work page 2016
[31]

IMPALA: Scalable distributed deep-rl with importance weighted actor-learner architectures

Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, V olodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA: Scalable distributed deep-rl with importance weighted actor-learner architectures. InProceedings of the 35th International Conference on Machine Learning, 2018

work page 2018
[32]

Staleness-aware Async-SGD for distributed deep learning

Wei Zhang, Suyog Gupta, Xiangru Lian, and Ji Liu. Staleness-aware Async-SGD for distributed deep learning. InProceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, 2016

work page 2016
[33]

PipeDream: Fast and efficient pipeline parallel DNN training

Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. PipeDream: Fast and efficient pipeline parallel DNN training. InProceedings of the 27th ACM Symposium on Operating Systems Principles, 2019

work page 2019
[34]

Yaosheng Xu, Dailin Hu, Litian Liang, Stephen Marcus McAleer, Pieter Abbeel, and Roy Fox

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine Learning, 8(3-4):229–256, 1992. doi: 10.1007/BF00992696

work page doi:10.1007/bf00992696 1992
[35]

When speed kills stability: Demystifying RL collapse from the training-inference mismatch

Jiacai Liu, Yingru Li, Yuqian Fu, Jiawei Wang, Qian Liu, and Yu Shen. When speed kills stability: Demystifying RL collapse from the training-inference mismatch. https://richardli.xyz/rl-colla pse, September 2025. Research blog post, accessed 2026-05-06

work page 2025
[36]

Gordon, and J

Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011. 11

work page 2011
[37]

Jordan, and Pieter Abbeel

John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. InProceedings of the 32nd International Conference on Machine Learning, 2015

work page 2015
[38]

Trust region masking for long-horizon LLM reinforcement learning.arXiv preprint arXiv:2512.23075, 2025

Yingru Li, Jiacai Liu, Jiawei Xu, Yuxuan Tong, Ziniu Li, Qian Liu, and Baoxiang Wang. Trust region masking for long-horizon LLM reinforcement learning.arXiv preprint arXiv:2512.23075, 2025

work page arXiv 2025
[39]

Batch size-invariance for policy optimization.Advances in Neural Information Processing Systems, 35:17086–17098, 2022

Jacob Hilton, Karl Cobbe, and John Schulman. Batch size-invariance for policy optimization.Advances in Neural Information Processing Systems, 35:17086–17098, 2022. Introduces decoupled PPO by separating the proximal policy for update control from the behavior policy for off-policy correction

work page 2022
[40]

On-policy distillation

Kevin Lu and Thinking Machines Lab. On-policy distillation. https://thinkingmachines.ai/blog /on-policy-distillation/, 2025. Thinking Machines Lab blog post, published 2025-10-27, accessed 2026-05-06

work page 2025
[41]

DAPO-Math-17k

BytedTsinghua-SIA. DAPO-Math-17k. https://huggingface.co/datasets/BytedTsinghua-SIA /DAPO-Math-17k, 2025. Hugging Face dataset repository, accessed 2026-05-06

work page 2025
[42]

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. ReTool: Reinforcement learning for strategic tool use in LLMs.arXiv preprint arXiv:2504.11536, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R. Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems, 2024. URL https://arxiv.org/abs/2405.15793 . Recommended citation for mini-SWE-agent from the project repository

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

work page 2021
[45]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations, 2024

work page 2024
[46]

Training Software Engineering Agents and Verifiers with SWE-Gym

Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with SWE-Gym. InProceedings of the 42nd International Conference on Machine Learning, 2025. arXiv:2412.21139

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

arXiv preprint arXiv:2511.16108(2025)

Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R. Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025

work page arXiv 2025
[48]

Deepswe: Training a fully open-sourced, state-of-the-art coding agent by scaling rl

Michael Luo, Naman Jain, Jaskirat Singh, Sijun Tan, Colin Cai, Tarun Venkat, Manan Roongta, Li Erran Li, Raluca Ada Popa, Koushik Sen, Ion Stoica, Ameen Patel, Qingyang Wu, Alpay Ariyak, Shang Zhu, Ben Athiwaratkun, and Ce Zhang. Deepswe: Training a fully open-sourced, state-of-the-art coding agent by scaling rl. https://pretty-radio-b75.notion.site/DeepS...

work page 2025
[49]

Qwen3-Coder-Next Technical Report

Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. Qwen3-coder-next technical report.arXiv preprint arXiv:2603.00729, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[51]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021. 12

work page internal anchor Pith review Pith/arXiv arXiv 2021
[52]

AgentBench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InInternational Conference on Learning Representations, 2024

work page 2024
[53]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

VisualWebArena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

work page 2024
[55]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang

John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. SWE-smith: Scaling data for software engineering agents. InAdvances in Neural Information Processing Systems, 2025. URL https://openreview.net /forum?id=63iVrXc8cC. Datasets and Benchmarks Track Spotlight. 13...

work page 2025
[57]

rollout drift and supervision drift are distinct nonnegative mismatch channels

work page
[58]

larger values of those diagnostics increase potential objective discrepancy under the stated assumptions, while larger lag enlarges the budget over which rollout drift may accumulate; and

work page
[59]

What the theory doesnotclaim is that α and β are universal constants or that the chosen surrogate is uniquely optimal

mapping those signals through a monotone freshness transformation suppresses higher-risk samples. What the theory doesnotclaim is that α and β are universal constants or that the chosen surrogate is uniquely optimal. In practice, they should be understood as domain-dependent calibration parameters that align observable mismatch diagnostics onto a common o...

work page

[1] [1]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Deepseek-v4: Towards highly efficient million-token context intelligence

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence. https://hu ggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf , 2026. technical report

work page 2026

[6] [6]

MiMo-V2-Flash Technical Report

Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv:2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

slime: An llm post-training framework for rl scaling.https://github.com/THUDM/slime, 2025

Zilin Zhu, Chengxing Xie, Xin Lv, and slime Contributors. slime: An llm post-training framework for rl scaling.https://github.com/THUDM/slime, 2025. GitHub repository, accessed 2026-05-06

work page 2025

[10] [10]

Nemo rl: A scalable and efficient post-training library.https://github.com/NVIDIA-NeMo/RL,

Nvidia. Nemo rl: A scalable and efficient post-training library.https://github.com/NVIDIA-NeMo/RL,

work page

[11] [11]

Scheduled sampling for sequence prediction with recurrent neural networks.Advances in neural information processing systems, 28, 2015

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks.Advances in neural information processing systems, 28, 2015

work page 2015

[12] [12]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Relax: Reasoning with latent exploration for large reasoning models.arXiv preprint arXiv:2512.07558, 2025

Shimin Zhang, Xianwei Chen, Yufan Shen, Ziyuan Ye, and Jibin Wu. Relax: Reasoning with latent exploration for large reasoning models.arXiv preprint arXiv:2512.07558, 2025

work page arXiv 2025

[15] [15]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

MiniLLM: On-policy distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: On-policy distillation of large language models. InInternational Conference on Learning Representations, 2024

work page 2024

[17] [17]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In International Conference on Learning Representations, 2024. 10

work page 2024

[18] [18]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan- ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self- distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

OPSDL: On-Policy Self-Distillation for Long-Context Language Models

Xinsen Zhang, Zhenkai Ding, Tianjun Pan, Run Yang, Chun Kang, Xue Xiong, and Jingnan Gu. Opsdl: On-policy self-distillation for long-context language models.arXiv preprint arXiv:2604.17535, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. Crisp: Compressed reasoning via iterative self-policy distillation.arXiv preprint arXiv:2603.05433, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E

Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137, 2026

work page arXiv 2026

[24] [24]

On-Policy Context Distillation for Language Models

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[26] [26]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

Black-box on-policy distillation of large language models.arXiv preprint, arXiv:2511.10643, 2025

Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, and Furu Wei. Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643, 2025

work page arXiv 2025

[28] [28]

Entropy-aware on-policy distillation of language models

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models.arXiv preprint arXiv:2603.07079, 2026. Also available on OpenReview as SPOT 2026

work page arXiv 2026

[29] [29]

Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D. Lyng, Sanjit Singh Batra, and Robert E. Tillman. Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

work page arXiv 2026

[30] [30]

Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu

V olodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, 2016

work page 2016

[31] [31]

IMPALA: Scalable distributed deep-rl with importance weighted actor-learner architectures

Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, V olodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA: Scalable distributed deep-rl with importance weighted actor-learner architectures. InProceedings of the 35th International Conference on Machine Learning, 2018

work page 2018

[32] [32]

Staleness-aware Async-SGD for distributed deep learning

Wei Zhang, Suyog Gupta, Xiangru Lian, and Ji Liu. Staleness-aware Async-SGD for distributed deep learning. InProceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, 2016

work page 2016

[33] [33]

PipeDream: Fast and efficient pipeline parallel DNN training

Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. PipeDream: Fast and efficient pipeline parallel DNN training. InProceedings of the 27th ACM Symposium on Operating Systems Principles, 2019

work page 2019

[34] [34]

Yaosheng Xu, Dailin Hu, Litian Liang, Stephen Marcus McAleer, Pieter Abbeel, and Roy Fox

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine Learning, 8(3-4):229–256, 1992. doi: 10.1007/BF00992696

work page doi:10.1007/bf00992696 1992

[35] [35]

When speed kills stability: Demystifying RL collapse from the training-inference mismatch

Jiacai Liu, Yingru Li, Yuqian Fu, Jiawei Wang, Qian Liu, and Yu Shen. When speed kills stability: Demystifying RL collapse from the training-inference mismatch. https://richardli.xyz/rl-colla pse, September 2025. Research blog post, accessed 2026-05-06

work page 2025

[36] [36]

Gordon, and J

Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011. 11

work page 2011

[37] [37]

Jordan, and Pieter Abbeel

John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. InProceedings of the 32nd International Conference on Machine Learning, 2015

work page 2015

[38] [38]

Trust region masking for long-horizon LLM reinforcement learning.arXiv preprint arXiv:2512.23075, 2025

Yingru Li, Jiacai Liu, Jiawei Xu, Yuxuan Tong, Ziniu Li, Qian Liu, and Baoxiang Wang. Trust region masking for long-horizon LLM reinforcement learning.arXiv preprint arXiv:2512.23075, 2025

work page arXiv 2025

[39] [39]

Batch size-invariance for policy optimization.Advances in Neural Information Processing Systems, 35:17086–17098, 2022

Jacob Hilton, Karl Cobbe, and John Schulman. Batch size-invariance for policy optimization.Advances in Neural Information Processing Systems, 35:17086–17098, 2022. Introduces decoupled PPO by separating the proximal policy for update control from the behavior policy for off-policy correction

work page 2022

[40] [40]

On-policy distillation

Kevin Lu and Thinking Machines Lab. On-policy distillation. https://thinkingmachines.ai/blog /on-policy-distillation/, 2025. Thinking Machines Lab blog post, published 2025-10-27, accessed 2026-05-06

work page 2025

[41] [41]

DAPO-Math-17k

BytedTsinghua-SIA. DAPO-Math-17k. https://huggingface.co/datasets/BytedTsinghua-SIA /DAPO-Math-17k, 2025. Hugging Face dataset repository, accessed 2026-05-06

work page 2025

[42] [42]

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. ReTool: Reinforcement learning for strategic tool use in LLMs.arXiv preprint arXiv:2504.11536, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R. Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems, 2024. URL https://arxiv.org/abs/2405.15793 . Recommended citation for mini-SWE-agent from the project repository

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

work page 2021

[45] [45]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations, 2024

work page 2024

[46] [46]

Training Software Engineering Agents and Verifiers with SWE-Gym

Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with SWE-Gym. InProceedings of the 42nd International Conference on Machine Learning, 2025. arXiv:2412.21139

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

arXiv preprint arXiv:2511.16108(2025)

Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R. Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025

work page arXiv 2025

[48] [48]

Deepswe: Training a fully open-sourced, state-of-the-art coding agent by scaling rl

Michael Luo, Naman Jain, Jaskirat Singh, Sijun Tan, Colin Cai, Tarun Venkat, Manan Roongta, Li Erran Li, Raluca Ada Popa, Koushik Sen, Ion Stoica, Ameen Patel, Qingyang Wu, Alpay Ariyak, Shang Zhu, Ben Athiwaratkun, and Ce Zhang. Deepswe: Training a fully open-sourced, state-of-the-art coding agent by scaling rl. https://pretty-radio-b75.notion.site/DeepS...

work page 2025

[49] [49]

Qwen3-Coder-Next Technical Report

Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. Qwen3-coder-next technical report.arXiv preprint arXiv:2603.00729, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[50] [50]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[51] [51]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021. 12

work page internal anchor Pith review Pith/arXiv arXiv 2021

[52] [52]

AgentBench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InInternational Conference on Learning Representations, 2024

work page 2024

[53] [53]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

VisualWebArena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

work page 2024

[55] [55]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang

John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. SWE-smith: Scaling data for software engineering agents. InAdvances in Neural Information Processing Systems, 2025. URL https://openreview.net /forum?id=63iVrXc8cC. Datasets and Benchmarks Track Spotlight. 13...

work page 2025

[57] [57]

rollout drift and supervision drift are distinct nonnegative mismatch channels

work page

[58] [58]

larger values of those diagnostics increase potential objective discrepancy under the stated assumptions, while larger lag enlarges the budget over which rollout drift may accumulate; and

work page

[59] [59]

What the theory doesnotclaim is that α and β are universal constants or that the chosen surrogate is uniquely optimal

mapping those signals through a monotone freshness transformation suppresses higher-risk samples. What the theory doesnotclaim is that α and β are universal constants or that the chosen surrogate is uniquely optimal. In practice, they should be understood as domain-dependent calibration parameters that align observable mismatch diagnostics onto a common o...

work page