pith. machine review for the scientific record. sign in

arxiv: 2505.24298 · v5 · submitted 2025-05-30 · 💻 cs.LG · cs.AI

Recognition: no theorem link

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords asynchronous reinforcement learninglarge language modelsreasoningPPOtraining systemsGPU utilizationstaleness
0
0 comments X

The pith

AReaL decouples generation from training in reinforcement learning to achieve up to 2.77 times faster training for language models on reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AReaL, an asynchronous RL system for training LLMs on reasoning. Synchronous systems waste GPU time waiting for the slowest rollout in a batch. AReaL lets rollout workers generate continuously and training workers update as data arrives. Workload balancing controls staleness while a modified PPO handles outdated samples. Experiments show the system delivers substantial speedups without losing performance on math and code benchmarks.

Core claim

AReaL is a fully asynchronous RL system that completely decouples generation from training. Rollout workers continuously generate outputs while training workers update the model on collected batches, with workload balancing to control data staleness and a staleness-enhanced PPO variant for stability. This leads to up to 2.77× training speedup compared to synchronous systems with the same number of GPUs while matching or improving final performance on reasoning benchmarks.

What carries the argument

The asynchronous decoupling of rollout workers from training workers, combined with workload balancing and staleness-enhanced PPO, which allows continuous generation and model updates without batch synchronization waits.

Load-bearing premise

That balancing rollout and training workloads plus the staleness-enhanced PPO variant can maintain training stability and effectiveness even with outdated samples.

What would settle it

Running the same benchmarks on a synchronous system with identical GPUs and observing no speedup or a performance drop would falsify the efficiency claim.

read the original abstract

Reinforcement learning (RL) has become a dominant paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous, alternating generation and training in a batch setting where rollouts in each training batch are generated by the same model. This approach stabilizes RL training but suffers from severe system-level inefficiency: generation must wait until the longest output in the batch is completed before model updates, resulting in GPU underutilization. We present AReaL, a fully asynchronous RL system that completely decouples generation from training. Rollout workers in AReaL continuously generate new outputs without waiting, while training workers update the model whenever a batch of data is collected. AReaL also incorporates a collection of system-level optimizations, leading to substantially higher GPU utilization. To stabilize RL training, AReaL balances the workload of rollout and training workers to control data staleness, and adopts a staleness-enhanced PPO variant to better handle outdated training samples. Extensive experiments on math and code reasoning benchmarks show that AReaL achieves up to 2.77$\times$ training speedup compared to synchronous systems with the same number of GPUs and matched or improved final performance. The code of AReaL is available at https://github.com/inclusionAI/AReaL/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper presents AReaL, a fully asynchronous RL system for LLM reasoning that decouples rollout generation from training. Rollout workers generate continuously while training workers update on collected batches; workload balancing controls staleness and a staleness-enhanced PPO variant is used to maintain stability. Experiments on math and code benchmarks report up to 2.77× training speedup versus synchronous baselines with matched or improved final performance.

Significance. If the empirical claims hold under broader validation, AReaL would demonstrate a practical path to higher GPU utilization in large-scale LLM RL without performance loss, addressing a central systems bottleneck as model sizes and reasoning tasks grow.

major comments (3)
  1. [§4] §4 (Staleness-enhanced PPO variant): The paper supplies no description of the precise algorithmic modifications (e.g., adjusted clipping thresholds, importance-sampling corrections, or advantage re-weighting by staleness age), which is load-bearing for the claim that training remains stable and effective with outdated samples.
  2. [§5] §5 (Experiments): No ablation is presented that removes the staleness enhancement while retaining asynchronous execution, nor are there plots or tables of performance versus measured data staleness; without these the parity claim cannot be isolated from the particular benchmarks or unstated hyper-parameter retuning.
  3. [§5] §5 (Experiments): The reported speedups and performance parity lack details on exact baseline implementations, number of random seeds, statistical variance, and full hyper-parameter choices, which are required to verify the central 2.77× claim and reproducibility.
minor comments (1)
  1. [Abstract] The abstract refers to 'a collection of system-level optimizations' without enumerating them; a brief list or pointer to the relevant subsection would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help improve the clarity and reproducibility of our work. We address each major comment below and will revise the manuscript to incorporate the requested details and additional experiments.

read point-by-point responses
  1. Referee: [§4] §4 (Staleness-enhanced PPO variant): The paper supplies no description of the precise algorithmic modifications (e.g., adjusted clipping thresholds, importance-sampling corrections, or advantage re-weighting by staleness age), which is load-bearing for the claim that training remains stable and effective with outdated samples.

    Authors: We agree that §4 lacks a precise description of the modifications in the staleness-enhanced PPO variant. In the revised manuscript we will expand this section with the full algorithmic details, including any changes to clipping thresholds, importance-sampling corrections, and advantage re-weighting as a function of staleness age. These additions will directly support the stability claims for training with outdated samples. revision: yes

  2. Referee: [§5] §5 (Experiments): No ablation is presented that removes the staleness enhancement while retaining asynchronous execution, nor are there plots or tables of performance versus measured data staleness; without these the parity claim cannot be isolated from the particular benchmarks or unstated hyper-parameter retuning.

    Authors: We acknowledge that an ablation isolating the staleness enhancement (while retaining asynchronous execution) and plots/tables of performance versus measured staleness would strengthen the experimental section. We will add both in the revised manuscript: an ablation comparing the full AReaL system against an asynchronous baseline without the staleness enhancement, plus figures showing final performance and training curves as functions of average data staleness. These will help isolate the contribution of the enhancement from benchmark-specific effects or hyper-parameter choices. revision: yes

  3. Referee: [§5] §5 (Experiments): The reported speedups and performance parity lack details on exact baseline implementations, number of random seeds, statistical variance, and full hyper-parameter choices, which are required to verify the central 2.77× claim and reproducibility.

    Authors: We agree that additional implementation and statistical details are necessary for reproducibility. In the revised manuscript we will expand the experimental section with: (i) precise descriptions of the synchronous baseline implementations, (ii) the number of random seeds used for each result, (iii) statistical variance (standard deviations across seeds), and (iv) complete hyper-parameter tables for all methods and benchmarks. This will allow independent verification of the reported speedups and performance parity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical systems results rest on direct runtime measurements

full rationale

The paper describes an asynchronous RL training system (AReaL) whose core claims are measured speedups (up to 2.77×) and matched/improved benchmark performance on math and code tasks. These outcomes are obtained from end-to-end experiments that compare wall-clock training time and final accuracy against synchronous baselines under identical GPU counts. No mathematical derivation chain, fitted-parameter prediction, or self-citation load-bearing uniqueness theorem is present; the staleness-handling mechanisms are presented as engineering choices whose effectiveness is validated by the same runtime data rather than by construction or tautology. The evaluation is therefore externally falsifiable by re-running the open-source code on the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The contribution is primarily architectural and empirical; no new mathematical axioms, free parameters fitted inside a derivation, or invented physical entities are introduced.

pith-pipeline@v0.9.0 · 5602 in / 1023 out tokens · 41166 ms · 2026-05-15T14:19:43.363131+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AIS: Adaptive Importance Sampling for Quantized RL

    stat.ML 2026-05 unverdicted novelty 7.0

    AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.

  2. BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...

  3. Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning

    cs.CL 2026-04 unverdicted novelty 7.0

    Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-p...

  4. Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.

  5. Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search

    cs.CV 2026-05 unverdicted novelty 6.0

    Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...

  6. FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration

    cs.LG 2026-05 unverdicted novelty 6.0

    FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...

  7. Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...

  8. T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    T²PO improves stability and performance in multi-turn agentic RL by using uncertainty dynamics at token and turn levels to guide exploration and avoid wasted rollouts.

  9. Co-Evolving Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...

  10. DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training

    cs.LG 2026-04 unverdicted novelty 6.0

    DORA's multi-version streaming rollout enables 2-3x higher throughput in asynchronous RL for LLMs while preserving convergence by maintaining policy consistency, data integrity, and bounded staleness.

  11. AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

    cs.AR 2026-04 unverdicted novelty 6.0

    AMMA is a memory-centric multi-chiplet architecture using HBM-PNM cubes, custom logic dies, hybrid parallelism, and reordered collectives that delivers 15.5X lower attention latency and 6.9X lower energy than NVIDIA H...

  12. JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training

    cs.LG 2026-04 unverdicted novelty 6.0

    JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.

  13. Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

    cs.CL 2026-04 unverdicted novelty 6.0

    Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.

  14. TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training

    cs.DC 2026-04 unverdicted novelty 6.0

    TensorHub uses Reference-Oriented Storage to enable scalable weight transfer in LLM RL training by referencing replicated GPU weights, achieving up to 19x reduction in cross-datacenter stall time.

  15. OpenClaw-RL: Train Any Agent Simply by Talking

    cs.CL 2026-03 unverdicted novelty 6.0

    OpenClaw-RL recovers evaluative and directive signals from next-state interactions to enable online RL training of agents across terminal, GUI, SWE, and tool environments via a server-client architecture and hybrid objective.

  16. WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

    cs.AI 2026-03 unverdicted novelty 6.0

    WebChain supplies the largest open dataset of real human web trajectories with triple-modal alignment and a dual mid-training method that separates grounding from planning to improve web agents.

  17. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    cs.AI 2025-03 unverdicted novelty 5.0

    The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

  18. Position: Agentic AI System Is a Foreseeable Pathway to AGI

    cs.AI 2026-05 unverdicted novelty 4.0

    Agentic AI systems with DAG topologies are claimed to deliver exponentially superior generalization and sample efficiency compared to monolithic scaling for achieving AGI.

  19. Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

    cs.AI 2026-05 unverdicted novelty 4.0

    Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.

  20. StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

    cs.CL 2026-04 unverdicted novelty 4.0

    StepPO argues that LLM agents should optimize at the step level rather than token level to better handle delayed rewards and long contexts in agentic RL.

  21. Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

    cs.AI 2026-05 unverdicted novelty 3.0

    Safactory combines parallel simulation, trustworthy data management, and asynchronous evolution platforms into a single pipeline claimed to be the first unified framework for trustworthy autonomous agents.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 20 Pith papers · 10 internal anchors

  1. [2]

    Dota 2 with Large Scale Deep Reinforcement Learning

    C. Berner, G. Brockman, B. Chan, V . Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fis- cher, S. Hashme, C. Hesse, R. Józefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P. de Oliveira Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, and S. Zhang. Dota 2 with large scale deep reinforcement learni...

  2. [3]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...

  3. [4]

    Z. Chen, A. May, R. Svirschevski, Y . Huang, M. Ryabinin, Z. Jia, and B. Chen. Se- quoia: Scalable and robust speculative decoding. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 129531–129563. Curran Associates, Inc., 2024. URL https://pro...

  4. [5]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021. URLhttps://arxiv.org/abs/2110.14168

  5. [7]

    Espeholt, H

    L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V . Mnih, T. Ward, Y . Doron, V . Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. IMPALA: scalable distributed deep-rl with impor- tance weighted actor-learner architectures. In J. G. Dy and A. Krause, editors,Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockhol...

  6. [8]

    Espeholt, R

    L. Espeholt, R. Marinier, P. Stanczyk, K. Wang, and M. Michalski. SEED RL: scalable and efficient deep-rl with accelerated central inference. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URLhttps://openreview.net/forum?id=rkgvXlrKwH

  7. [9]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. In J. Vanschoren and S. Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021. ...

  8. [10]

    Hilton, K

    J. Hilton, K. Cobbe, and J. Schulman. Batch size-invariance for policy optimization. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - Decem- ber 9, 2022, 2022....

  9. [12]

    J. Hu, Y . Zhang, Q. Han, D. Jiang, X. Zhang, and H. Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.CoRR, abs/2503.24290,

  10. [15]

    N. Jain, K. Han, A. Gu, W. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview. net/foru...

  11. [16]

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

  12. [17]

    URLhttps://openreview.net/forum?id=VTF8yNQM66

  13. [18]

    Kapturowski, G

    S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and W. Dabney. Recurrent experience replay in distributed reinforcement learning. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URLhttps://openreview.net/forum?id=r1lyTjAqYX

  14. [20]

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In J. Flinn, M. I. Seltzer, P. Druschel, A. Kaufmann, and J. Mace, editors,Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October ...

  15. [21]

    K. Lei, Y . Jin, M. Zhai, K. Huang, H. Ye, and J. Zhai. PUZZLE: efficiently aligning large language models through light-weight context switch. In S. Bagchi and Y . Zhang, editors, Proceedings of the 2024 USENIX Annual Technical Conference, USENIX ATC 2024, Santa Clara, CA, USA, July 10-12, 2024, pages 127–140. USENIX Association, 2024. URL https: //www.u...

  16. [22]

    J. LI, E. Beeching, L. Tunstall, B. Lipkin, R. Soletskyi, S. C. Huang, K. Rasul, L. Yu, A. Jiang, Z. Shen, Z. Qin, B. Dong, L. Zhou, Y . Fleureau, G. Lample, and S. Polu. Numina- math. [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/ project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf) , 2024

  17. [23]

    Liang, R

    E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg, J. Gonzalez, M. I. Jor- dan, and I. Stoica. Rllib: Abstractions for distributed reinforcement learning. In J. G. Dy and A. Krause, editors,Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 ofProce...

  18. [24]

    B. Y . Lin, R. L. Bras, K. Richardson, A. Sabharwal, R. Poovendran, P. Clark, and Y . Choi. Zebralogic: On the scaling limits of llms for logical reasoning, 2025. URL https://arxiv. org/abs/2502.01100

  19. [25]

    X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y . Su, H. Sun, M. Huang, Y . Dong, and J. Tang. Agentbench: Evaluating llms as agents. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview...

  20. [26]

    M. Luo, S. Tan, R. Huang, A. Patel, A. Ariyak, Q. Wu, X. Shi, R. Xin, C. Cai, M. Weber, et al. Deepcoder: A fully open-source 14b coder at o3-mini level, 2025

  21. [27]

    M. Luo, S. Tan, J. Wong, X. Shi, W. Y . Tang, M. Roongta, C. Cai, J. Luo, L. E. Li, R. A. Popa, and I. Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/ DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2 ,

  22. [28]

    Z. Mei, W. Fu, J. Gao, G. Wang, H. Zhang, and Y . Wu. SRL: scaling distributed reinforcement learning to over ten thousand cores. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=lajn1iROCu

  23. [30]

    X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang, Z. Zhang, R. Y . Y . Wong, A. Zhu, L. Yang, X. Shi, C. Shi, Z. Chen, D. Arfeen, R. Abhyankar, and Z. Jia. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. InProceedings of the 29th ACM International Conference on Architectural Support for Programmi...

  24. [33]

    URL https://openai.com/index/ learning-to-reason-with-llms/

    OpenAI, Sep 2024. URL https://openai.com/index/ learning-to-reason-with-llms/

  25. [34]

    URL https://openai.com/index/ introducing-o3-and-o4-mini/

    OpenAI, Apr 2025. URL https://openai.com/index/ introducing-o3-and-o4-mini/

  26. [35]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Ch...

  27. [36]

    J. Pan, J. Zhang, X. Wang, L. Yuan, H. Peng, and A. Suhr. Tinyzero. https://github.com/Jiayi- Pan/TinyZero, 2025. Accessed: 2025-01-24

  28. [37]

    Paszke, S

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Z. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An im- perative style, high-performance deep learning library. In H. M. Wallach, H. Larochelle, A. Beygelz...

  29. [39]

    M. L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Statistics. Wiley, 1994. ISBN 978-0-47161977-2. doi: 10.1002/ 9780470316887. URLhttps://doi.org/10.1002/9780470316887

  30. [40]

    Generalized Slow Roll for Tensors

    S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He. Zero: memory optimizations toward training trillion parameter models. In C. Cuicchi, I. Qualters, and W. T. Kramer, editors,Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, page...

  31. [43]

    Schulman, P

    J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. In Y . Bengio and Y . LeCun, editors,4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1506. 02438

  32. [44]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.CoRR, abs/1707.06347, 2017. URLhttp://arxiv.org/abs/1707.06347

  33. [45]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300,

  34. [48]

    Sheng, C

    G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu. Hybridflow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pages 1279–1297. ACM, 2025. doi: 10.1145/3689031.3696075. URL https://doi. or...

  35. [50]

    URLhttp://arxiv.org/abs/1909.08053

  36. [51]

    C. V . Snell, J. Lee, K. Xu, and A. Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview.net/forum?id=4FWAwZtd2n

  37. [53]

    P. I. Team, S. Jaghouar, J. Mattern, J. M. Ong, J. Straube, M. Basra, A. Pazdera, K. Thaman, M. D. Ferrante, F. Gabriel, F. Obeid, K. Erdem, M. Keiblinger, and J. Hagemann. Intellect-2: A reasoning model trained through globally decentralized reinforcement learning, 2025. URL https://arxiv.org/abs/2505.07291

  38. [54]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fer- gus, S. V . N. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Syste...

  39. [55]

    Vinyals, I

    O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V . Dalibard, D. Budden, Y . Sulsky, J. Molloy, T. L. Paine, Ç. Gülçehre, Z. Wang, T. Pfaff,...

  40. [56]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems...

  41. [57]

    H. Xin, D. Guo, Z. Shao, Z. Ren, Q. Zhu, B. Liu, C. Ruan, W. Li, and X. Liang. Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data.CoRR, abs/2405.14333,

  42. [64]

    A. B. Yoo, M. A. Jette, and M. Grondona. SLURM: simple linux utility for resource management. In D. G. Feitelson, L. Rudolph, and U. Schwiegelshohn, editors,Job Scheduling Strategies for Parallel Processing, 9th International Workshop, JSSPP 2003, Seattle, WA, USA, June 24, 2003, Revised Papers, volume 2862 ofLecture Notes in Computer Science, pages 44–60...

  43. [65]

    Job Scheduling Strategies for Parallel Processing pp 44–60

    doi: 10.1007/10968987\_3. URLhttps://doi.org/10.1007/10968987_3

  44. [66]

    Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, T. Fan, G. Liu, L. Liu, X. Liu, H. Lin, Z. Lin, B. Ma, G. Sheng, Y . Tong, C. Zhang, M. Zhang, W. Zhang, H. Zhu, J. Zhu, J. Chen, J. Chen, C. Wang, H. Yu, W. Dai, Y . Song, X. Wei, H. Zhou, J. Liu, W. Ma, Y . Zhang, L. Yan, M. Qiao, Y . Wu, and M. Wang. DAPO: an open-source LLM reinforcement learning sys...

  45. [67]

    Y . Yue, Y . Yuan, Q. Yu, X. Zuo, R. Zhu, W. Xu, J. Chen, C. Wang, T. Fan, Z. Du, X. Wei, X. Yu, G. Liu, J. Liu, L. Liu, H. Lin, Z. Lin, B. Ma, C. Zhang, M. Zhang, W. Zhang, H. Zhu, R. Zhang, X. Liu, M. Wang, Y . Wu, and L. Yan. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks, 2025. URLhttps://arxiv.org/abs/2504.05118

  46. [68]

    Y . Zhao, A. Gu, R. Varma, L. Luo, C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, A. Desmaison, C. Balioglu, P. Damania, B. Nguyen, G. Chauhan, Y . Hao, A. Mathews, and S. Li. Pytorch FSDP: experiences on scaling fully sharded data paral- lel.Proc. VLDB Endow., 16(12):3848–3860, 2023. doi: 10.14778/3611540.3611569. URL https://www.vldb.o...

  47. [69]

    Zheng, L

    L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. W. Barrett, and Y . Sheng. Sglang: Efficient execution of structured language model programs. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annua...

  48. [70]

    Limitations

    Y . Zhong, Z. Zhang, X. Song, H. Hu, C. Jin, B. Wu, N. Chen, Y . Chen, Y . Zhou, C. Wan, H. Zhou, Y . Jiang, Y . Zhu, and D. Jiang. Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation, 2025. URLhttps://arxiv.org/abs/2504.15930. 16 NeurIPS Paper Checklist 1.Claims Question: Do the main claims made in the abstract...

  49. [71]

    Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

  50. [72]

    For most of the results, we use SGLang [63] v0.4.6 as generation backend and pytorch FSDP [ 62] as training backend

    to evaluate the training throughput in Figure 4 and the training hours in Table 1. For most of the results, we use SGLang [63] v0.4.6 as generation backend and pytorch FSDP [ 62] as training backend. In a few cases where SGLang raises errors (experiments with 32B models or 64 nodes), we use vLLM [18] v0.8.4 as a substitution. C Additional Results C.1 Addi...