pith. sign in

arxiv: 2605.17862 · v1 · pith:Q3QDWVBTnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

boldsymbol{f}-OPD: Stabilizing Long-Horizon On-Policy Distillation with Freshness-Aware Control

Pith reviewed 2026-05-20 12:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords on-policy distillationasynchronous trainingfreshness scorepolicy driftLLM agentslong-horizon tasksreasoningtool use
0
0 comments X

The pith

A freshness score lets asynchronous on-policy distillation match synchronous performance on long-horizon agent tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper confronts the efficiency-performance tension in scaling on-policy distillation for large language models, where asynchronous execution is required for throughput but creates objective deviations through stale samples. It decomposes these deviations into rollout drift from the student and supervision drift from the teacher, then defines a sample-level freshness score to measure each buffered sample's reliability relative to the ideal on-policy objective. The f-OPD framework uses this score to adaptively limit the influence of stale samples and thereby bound accumulated policy drift. A sympathetic reader would care because the result shows that long-horizon reasoning, tool-use, and coding-agent tasks can be trained at scale without forcing a choice between speed and correctness.

Core claim

f-OPD stabilizes long-horizon on-policy distillation by theoretically decomposing the objective discrepancy into rollout drift and supervision drift, introducing a sample-level freshness score that quantifies the reliability of buffered samples with respect to the on-policy objective, and adaptively regulating stale-sample influence to constrain accumulated policy drift under asynchronous execution, thereby achieving task performance comparable to synchronous optimization while largely retaining the throughput advantages of asynchronous execution.

What carries the argument

The sample-level freshness score, which quantifies how much a buffered sample deviates from the ideal on-policy objective and guides adaptive regulation of its influence within the f-OPD framework.

If this is right

  • Asynchronous execution becomes viable for on-policy distillation without large performance penalties on extended interaction horizons.
  • Policy drift can be controlled by weighting samples according to measured staleness rather than by enforcing strict synchronization.
  • The same freshness signal can be applied across reasoning, tool-use, and coding-agent tasks as horizon length increases.
  • Throughput gains from asynchrony are preserved while task success rates remain comparable to fully synchronous baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The freshness mechanism may transfer to other asynchronous reinforcement-learning or distillation pipelines that suffer from rollout or supervision staleness.
  • Combining the score with existing replay-buffer techniques could further reduce the synchronization overhead in distributed agent training.
  • If the decomposition into rollout and supervision drift proves general, similar freshness controls could stabilize longer-horizon post-training without additional compute.
  • The approach suggests that sample reliability signals, rather than global synchronization, may become the default way to manage drift in large-scale agentic training.

Load-bearing premise

The sample-level freshness score reliably quantifies deviation from the ideal on-policy objective and adaptively regulating its influence is sufficient to constrain accumulated policy drift under asynchronous execution.

What would settle it

Running f-OPD on the same long-horizon tasks without the freshness-based regulation and observing whether performance collapses to the level of naive asynchronous distillation, or running it with regulation and seeing whether performance still falls substantially below synchronous optimization, would settle the central claim.

Figures

Figures reproduced from arXiv: 2605.17862 by Jibin Wu, Shimin Zhang, Xianwei Chen.

Figure 1
Figure 1. Figure 1: (a) System implementation of OPD under synchronous (top) and asynchronous (bottom) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Systematic overview of f-OPD. Top: three sample-level diagnostics used to characterize staleness. Bottom: the overall f-OPD pipeline, where sample freshness is estimated from these diagnostics and integrated into OPD optimization through three complementary mechanisms. 4 Freshness-Aware Control for Long-Horizon OPD Our analysis in Sec. 3 has decomposed asynchronous OPD staleness as rollout and supervision … view at source ↗
Figure 3
Figure 3. Figure 3: Failure modes of vanilla OPD under increasing policy update lag. (a) Relative task [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a–c) Training dynamics across tasks for synchronous OPD, asynchronous OPD, and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Supplementary entropy dynamics under fixed lag values. Unlike Figure 3(c), which reports [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Supplementary behavioral failure analysis for the long-horizon coding-agent setting. Bars [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
read the original abstract

Scaling on-policy distillation (OPD) for large language models (LLMs) confronts a fundamental tension: asynchronous execution is necessary for system efficiency, but structurally deviates from the ideal on-policy objective. To address this challenge, we theoretically decompose the objective discrepancy into rollout drift and supervision drift, capturing staleness in student rollout and teacher context, respectively. Building on this, we introduce a sample-level freshness score that quantifies the reliability of a buffered sample with respect to the on-policy objective. Guided by this signal, we further propose f-OPD, a novel framework that adaptively regulates stale-sample influence and constrains policy drift accumulated under asynchronous training. Across reasoning, tool-use, and coding-agent tasks of increasing interaction horizon, f-OPD consistently achieves task performance comparable to synchronous optimization while largely retaining the throughput advantages of asynchronous execution. Our results establish the first recipe for achieving a performance-efficiency trade-off in OPD, paving the way for long-horizon agentic post-training at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to resolve the efficiency-performance tension in scaling on-policy distillation (OPD) for LLMs by theoretically decomposing the objective discrepancy into rollout drift (staleness in student rollout) and supervision drift (staleness in teacher context). It introduces a sample-level freshness score to quantify buffered-sample reliability relative to the on-policy objective and proposes the f-OPD framework that adaptively regulates stale-sample influence to constrain accumulated policy drift under asynchronous execution. Empirical results across reasoning, tool-use, and coding-agent tasks of increasing interaction horizon show task performance comparable to synchronous optimization while largely retaining asynchronous throughput advantages.

Significance. If the decomposition and freshness-aware regulation prove effective at controlling policy drift without introducing new biases, the work would offer a practical and principled recipe for the performance-efficiency trade-off in long-horizon OPD. This could meaningfully advance scalable agentic post-training by allowing asynchronous execution without sacrificing on-policy fidelity, with the multi-task empirical validation providing initial evidence of generality.

major comments (2)
  1. [§3.1] §3.1 (theoretical decomposition): The central claim that the objective discrepancy decomposes cleanly into rollout drift and supervision drift is load-bearing for the entire f-OPD construction; the manuscript should provide the explicit derivation (including any assumptions on the policy update and buffer dynamics) to confirm that the two terms are exhaustive and non-overlapping.
  2. [§4.2] §4.2 (freshness score definition): The sample-level freshness score is presented as reliably quantifying deviation from the ideal on-policy objective, yet the weakest assumption in the work is that adaptively regulating its influence is sufficient to bound accumulated drift; an ablation or sensitivity analysis showing how performance degrades when the score is replaced by a simpler heuristic (e.g., age only) would strengthen this claim.
minor comments (2)
  1. [Abstract] The abstract states the performance claims without any quantitative numbers or confidence intervals; adding a brief summary of the key metrics (e.g., success rate deltas and throughput ratios) would improve readability.
  2. [§3 and §5] Notation for the freshness score and the two drift terms should be introduced once in §3 and used consistently thereafter; occasional redefinition in the experimental section creates unnecessary ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address each major comment below and will update the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (theoretical decomposition): The central claim that the objective discrepancy decomposes cleanly into rollout drift and supervision drift is load-bearing for the entire f-OPD construction; the manuscript should provide the explicit derivation (including any assumptions on the policy update and buffer dynamics) to confirm that the two terms are exhaustive and non-overlapping.

    Authors: We agree that an explicit derivation strengthens the foundation of the decomposition. In the revised manuscript we will expand §3.1 with the full step-by-step derivation, stating the assumptions on policy updates and buffer dynamics, and showing that rollout drift and supervision drift are exhaustive and non-overlapping under those conditions. revision: yes

  2. Referee: [§4.2] §4.2 (freshness score definition): The sample-level freshness score is presented as reliably quantifying deviation from the ideal on-policy objective, yet the weakest assumption in the work is that adaptively regulating its influence is sufficient to bound accumulated drift; an ablation or sensitivity analysis showing how performance degrades when the score is replaced by a simpler heuristic (e.g., age only) would strengthen this claim.

    Authors: We acknowledge that an ablation against a simpler baseline such as age-only regulation would provide useful supporting evidence. We will add this ablation to the revised version, reporting performance when the freshness score is replaced by sample age across the reasoning, tool-use, and coding-agent tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The abstract and available description present a theoretical decomposition of objective discrepancy into rollout drift and supervision drift as an independent analytical step, followed by the definition of a sample-level freshness score and the proposal of the f-OPD framework for adaptive regulation. No equations, fitted parameters, or self-citations are exhibited that reduce any claimed prediction or result back to its own inputs by construction. The performance claims rest on empirical comparisons across tasks rather than on a self-referential loop, rendering the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified assumption that the proposed freshness score can be computed from observable quantities and that its use directly controls policy drift; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption The objective discrepancy in asynchronous on-policy distillation can be decomposed into rollout drift and supervision drift.
    This decomposition is presented as the theoretical starting point for the freshness score.

pith-pipeline@v0.9.0 · 5712 in / 1186 out tokens · 44454 ms · 2026-05-20T12:23:11.111793+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 26 internal anchors

  1. [1]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

  2. [2]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  3. [3]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  4. [4]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  5. [5]

    Deepseek-v4: Towards highly efficient million-token context intelligence

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence. https://hu ggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf , 2026. technical report

  6. [6]

    MiMo-V2-Flash Technical Report

    Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

  7. [7]

    GLM-5: from Vibe Coding to Agentic Engineering

    Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

  8. [8]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv:2409.19256, 2024

  9. [9]

    slime: An llm post-training framework for rl scaling.https://github.com/THUDM/slime, 2025

    Zilin Zhu, Chengxing Xie, Xin Lv, and slime Contributors. slime: An llm post-training framework for rl scaling.https://github.com/THUDM/slime, 2025. GitHub repository, accessed 2026-05-06

  10. [10]

    Nemo rl: A scalable and efficient post-training library.https://github.com/NVIDIA-NeMo/RL,

    Nvidia. Nemo rl: A scalable and efficient post-training library.https://github.com/NVIDIA-NeMo/RL,

  11. [11]

    Scheduled sampling for sequence prediction with recurrent neural networks.Advances in neural information processing systems, 28, 2015

    Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks.Advances in neural information processing systems, 28, 2015

  12. [12]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  13. [13]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

  14. [14]

    Relax: Reasoning with latent exploration for large reasoning models.arXiv preprint arXiv:2512.07558, 2025

    Shimin Zhang, Xianwei Chen, Yufan Shen, Ziyuan Ye, and Jibin Wu. Relax: Reasoning with latent exploration for large reasoning models.arXiv preprint arXiv:2512.07558, 2025

  15. [15]

    A Survey of On-Policy Distillation for Large Language Models

    Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626, 2026

  16. [16]

    MiniLLM: On-policy distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: On-policy distillation of large language models. InInternational Conference on Learning Representations, 2024

  17. [17]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In International Conference on Learning Representations, 2024. 10

  18. [18]

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan- ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

  19. [19]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

  20. [20]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self- distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

  21. [21]

    OPSDL: On-Policy Self-Distillation for Long-Context Language Models

    Xinsen Zhang, Zhenkai Ding, Tianjun Pan, Run Yang, Chun Kang, Xue Xiong, and Jingnan Gu. Opsdl: On-policy self-distillation for long-context language models.arXiv preprint arXiv:2604.17535, 2026

  22. [22]

    CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

    Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. Crisp: Compressed reasoning via iterative self-policy distillation.arXiv preprint arXiv:2603.05433, 2026

  23. [23]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E

    Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137, 2026

  24. [24]

    On-Policy Context Distillation for Language Models

    Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275, 2026

  25. [25]

    Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

    Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

  26. [26]

    Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

    Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

  27. [27]

    Black-box on-policy distillation of large language models.arXiv preprint, arXiv:2511.10643, 2025

    Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, and Furu Wei. Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643, 2025

  28. [28]

    Entropy-aware on-policy distillation of language models

    Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models.arXiv preprint arXiv:2603.07079, 2026. Also available on OpenReview as SPOT 2026

  29. [29]

    Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

    Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D. Lyng, Sanjit Singh Batra, and Robert E. Tillman. Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

  30. [30]

    Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu

    V olodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, 2016

  31. [31]

    IMPALA: Scalable distributed deep-rl with importance weighted actor-learner architectures

    Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, V olodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA: Scalable distributed deep-rl with importance weighted actor-learner architectures. InProceedings of the 35th International Conference on Machine Learning, 2018

  32. [32]

    Staleness-aware Async-SGD for distributed deep learning

    Wei Zhang, Suyog Gupta, Xiangru Lian, and Ji Liu. Staleness-aware Async-SGD for distributed deep learning. InProceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, 2016

  33. [33]

    PipeDream: Fast and efficient pipeline parallel DNN training

    Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. PipeDream: Fast and efficient pipeline parallel DNN training. InProceedings of the 27th ACM Symposium on Operating Systems Principles, 2019

  34. [34]

    Yaosheng Xu, Dailin Hu, Litian Liang, Stephen Marcus McAleer, Pieter Abbeel, and Roy Fox

    Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine Learning, 8(3-4):229–256, 1992. doi: 10.1007/BF00992696

  35. [35]

    When speed kills stability: Demystifying RL collapse from the training-inference mismatch

    Jiacai Liu, Yingru Li, Yuqian Fu, Jiawei Wang, Qian Liu, and Yu Shen. When speed kills stability: Demystifying RL collapse from the training-inference mismatch. https://richardli.xyz/rl-colla pse, September 2025. Research blog post, accessed 2026-05-06

  36. [36]

    Gordon, and J

    Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011. 11

  37. [37]

    Jordan, and Pieter Abbeel

    John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. InProceedings of the 32nd International Conference on Machine Learning, 2015

  38. [38]

    Trust region masking for long-horizon LLM reinforcement learning.arXiv preprint arXiv:2512.23075, 2025

    Yingru Li, Jiacai Liu, Jiawei Xu, Yuxuan Tong, Ziniu Li, Qian Liu, and Baoxiang Wang. Trust region masking for long-horizon LLM reinforcement learning.arXiv preprint arXiv:2512.23075, 2025

  39. [39]

    Batch size-invariance for policy optimization.Advances in Neural Information Processing Systems, 35:17086–17098, 2022

    Jacob Hilton, Karl Cobbe, and John Schulman. Batch size-invariance for policy optimization.Advances in Neural Information Processing Systems, 35:17086–17098, 2022. Introduces decoupled PPO by separating the proximal policy for update control from the behavior policy for off-policy correction

  40. [40]

    On-policy distillation

    Kevin Lu and Thinking Machines Lab. On-policy distillation. https://thinkingmachines.ai/blog /on-policy-distillation/, 2025. Thinking Machines Lab blog post, published 2025-10-27, accessed 2026-05-06

  41. [41]

    DAPO-Math-17k

    BytedTsinghua-SIA. DAPO-Math-17k. https://huggingface.co/datasets/BytedTsinghua-SIA /DAPO-Math-17k, 2025. Hugging Face dataset repository, accessed 2026-05-06

  42. [42]

    ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

    Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. ReTool: Reinforcement learning for strategic tool use in LLMs.arXiv preprint arXiv:2504.11536, 2025

  43. [43]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R. Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems, 2024. URL https://arxiv.org/abs/2405.15793 . Recommended citation for mini-SWE-agent from the project repository

  44. [44]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

  45. [45]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations, 2024

  46. [46]

    Training Software Engineering Agents and Verifiers with SWE-Gym

    Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with SWE-Gym. InProceedings of the 42nd International Conference on Machine Learning, 2025. arXiv:2412.21139

  47. [47]

    arXiv preprint arXiv:2511.16108(2025)

    Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R. Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025

  48. [48]

    Deepswe: Training a fully open-sourced, state-of-the-art coding agent by scaling rl

    Michael Luo, Naman Jain, Jaskirat Singh, Sijun Tan, Colin Cai, Tarun Venkat, Manan Roongta, Li Erran Li, Raluca Ada Popa, Koushik Sen, Ion Stoica, Ameen Patel, Qingyang Wu, Alpay Ariyak, Shang Zhu, Ben Athiwaratkun, and Ce Zhang. Deepswe: Training a fully open-sourced, state-of-the-art coding agent by scaling rl. https://pretty-radio-b75.notion.site/DeepS...

  49. [49]

    Qwen3-Coder-Next Technical Report

    Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. Qwen3-coder-next technical report.arXiv preprint arXiv:2603.00729, 2026

  50. [50]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  51. [51]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021. 12

  52. [52]

    AgentBench: Evaluating LLMs as agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InInternational Conference on Learning Representations, 2024

  53. [53]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2024

  54. [54]

    VisualWebArena: Evaluating multimodal agents on realistic visual web tasks

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

  55. [55]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

  56. [56]

    Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang

    John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. SWE-smith: Scaling data for software engineering agents. InAdvances in Neural Information Processing Systems, 2025. URL https://openreview.net /forum?id=63iVrXc8cC. Datasets and Benchmarks Track Spotlight. 13...

  57. [57]

    rollout drift and supervision drift are distinct nonnegative mismatch channels

  58. [58]

    larger values of those diagnostics increase potential objective discrepancy under the stated assumptions, while larger lag enlarges the budget over which rollout drift may accumulate; and

  59. [59]

    What the theory doesnotclaim is that α and β are universal constants or that the chosen surrogate is uniquely optimal

    mapping those signals through a monotone freshness transformation suppresses higher-risk samples. What the theory doesnotclaim is that α and β are universal constants or that the chosen surrogate is uniquely optimal. In practice, they should be understood as domain-dependent calibration parameters that align observable mismatch diagnostics onto a common o...