pith. machine review for the scientific record. sign in

arxiv: 2604.23838 · v1 · submitted 2026-04-26 · 💻 cs.LG

Recognition: unknown

JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:32 UTC · model grok-4.3

classification 💻 cs.LG
keywords JigsawRLpipeline multiplexingRL for LLMssub-stage graphdynamic resource allocationlook-ahead schedulingLLM post-trainingthroughput optimization
0
0 comments X

The pith

JigsawRL assembles efficient RL pipelines for LLM post-training by decomposing stages into sub-graphs and applying dynamic multiplexing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

JigsawRL introduces a framework for cost-efficient reinforcement learning pipelines in large language model post-training. It explores pipeline multiplexing as a new dimension of parallelism by breaking each pipeline into a Sub-Stage Graph that exposes intra-stage and inter-worker imbalances hidden in conventional stage-level designs. The system resolves multiplexing interference via dynamic resource allocation, reduces fragmented GPU utilization through migration of long-tail rollouts, and coordinates the process as a graph scheduling problem solved by a look-ahead heuristic. Experiments across 4-64 H100/A100 GPUs on different agentic RL pipelines and models report throughput gains of up to 1.85x over Verl for synchronous RL and 1.54x over StreamRL and AReaL for asynchronous RL, while supporting heterogeneous pipelines at moderate latency cost. Readers would care because higher utilization on existing hardware clusters could accelerate RL-based LLM post-training without proportional increases in compute resources.

Core claim

JigsawRL decomposes each RL pipeline into a Sub-Stage Graph to expose hidden imbalances, resolves multiplexing interference through dynamic resource allocation, eliminates fragmented utilization by migrating long-tail rollouts across workers, and formulates coordination as a graph scheduling problem solved with a look-ahead heuristic, delivering up to 1.85x throughput over Verl on synchronous RL and 1.54x over StreamRL and AReaL on asynchronous RL across 4-64 H100/A100 GPUs.

What carries the argument

The Sub-Stage Graph abstraction that breaks down pipeline stages to expose imbalances, paired with dynamic resource allocation, long-tail rollout migration, and look-ahead heuristic scheduling to manage pipeline multiplexing.

If this is right

  • JigsawRL achieves up to 1.85x higher throughput than Verl in synchronous RL pipelines.
  • It delivers up to 1.54x throughput gains over StreamRL and AReaL in asynchronous RL.
  • The framework supports heterogeneous RL pipelines with only moderate increases in latency.
  • Improved GPU utilization reduces the hardware requirements for large-scale LLM post-training.
  • Pipeline multiplexing becomes a viable new dimension for parallelism in RL systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could be extended to other distributed ML training workflows with imbalanced stages.
  • Better sub-stage visibility might inspire similar fine-grained scheduling in non-RL LLM inference serving.
  • Testing on even larger GPU counts or more diverse models could reveal scalability limits of the heuristic.
  • The approach highlights the potential for adaptive scheduling to handle variability in rollout lengths.

Load-bearing premise

The Sub-Stage Graph accurately captures all relevant imbalances and the dynamic allocation plus look-ahead scheduling can fix interference and long-tail issues with only moderate overhead and without introducing instability.

What would settle it

Running the system on pipelines with extreme rollout length variations or scaling to hundreds of GPUs and checking if the reported speedups hold or if overheads and instability appear.

Figures

Figures reproduced from arXiv: 2604.23838 by Chang Chen, Hehua Ouyang, Steven Swanson, Yue Guan, Yufei Ding, Zaifeng Pan, Zhengding Hu, Zhen Wang, Zhongkai Yu.

Figure 1
Figure 1. Figure 1: (a) Trade-off between pipeline throughput and mon view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of execution behaviors under existing RL frameworks (a-d) and JigsawRL’s sub-stage-level spatial view at source ↗
Figure 3
Figure 3. Figure 3: Different agentic behaviors in rollout stages and the view at source ↗
Figure 4
Figure 4. Figure 4: Comparison across different multiplexing methods. view at source ↗
Figure 5
Figure 5. Figure 5: Overview of JigsawRL. Starting from coarse-grained pipeline graphs, JigsawRL constructs fine-grained sub-stage view at source ↗
Figure 6
Figure 6. Figure 6: Decoding batch size variation across 10 adjacent view at source ↗
Figure 8
Figure 8. Figure 8: Slowdown under different SM partitioning when view at source ↗
Figure 11
Figure 11. Figure 11: Comparison between greedy and critical-path view at source ↗
Figure 10
Figure 10. Figure 10: JigsawRL mitigates inter-worker imbalance by view at source ↗
Figure 12
Figure 12. Figure 12: Throughput of homogeneous pipelines across different agentic RL pipelines on 8 H100 GPUs. view at source ↗
Figure 13
Figure 13. Figure 13: Throughput comparison between Verl and JigsawRL scaling from 8 to 64 A100 GPUs with larger models. view at source ↗
Figure 14
Figure 14. Figure 14: Latency increase of different multiplexing methods view at source ↗
Figure 16
Figure 16. Figure 16: Throughput of heterogeneous pipelines in Verl and view at source ↗
Figure 17
Figure 17. Figure 17: Throughput of JigsawRL multiplexing syn view at source ↗
Figure 18
Figure 18. Figure 18: Throughput-Cost trade-off for Qwen3-4B scaling view at source ↗
Figure 19
Figure 19. Figure 19: Pipeline multiplexing strategies and their impact on view at source ↗
read the original abstract

We present JigsawRL, a cost-efficient framework that explores Pipeline Multiplexing as a new dimension of RL parallelism. JigsawRL decomposes each pipeline into a Sub-Stage Graph that exposes the intra-stage and inter-worker imbalance hidden by stage-level systems. On this abstraction, JigsawRL resolves multiplexing interference through dynamic resource allocation, eliminates fragmented utilization by migrating long-tail rollouts across workers, and formulates their coordination as a graph scheduling problem solved with a look-ahead heuristic. On 4-64 H100/A100 GPUs across different agentic RL pipelines and models, JigsawRL achieves up to 1.85x throughput over Verl on synchronous RL, 1.54x over StreamRL and AReaL on asynchronous RL, and supports heterogeneous pipelines with moderate latency trade-off.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces JigsawRL, a framework for efficient LLM post-training via reinforcement learning that treats pipeline multiplexing as a new parallelism dimension. It decomposes RL pipelines into a Sub-Stage Graph to reveal intra-stage and inter-worker imbalances, applies dynamic resource allocation to resolve multiplexing interference, uses migration of long-tail rollouts to reduce fragmentation, and solves the resulting coordination via a graph scheduling problem with a look-ahead heuristic. Empirical results on 4-64 H100/A100 GPUs across agentic RL pipelines and models report up to 1.85x throughput gains versus Verl (synchronous) and 1.54x versus StreamRL/AReaL (asynchronous), plus support for heterogeneous pipelines at moderate latency cost.

Significance. If the throughput claims hold under rigorous controls, JigsawRL would represent a practical advance in systems for RL-based LLM post-training by improving GPU utilization through fine-grained multiplexing rather than stage-level parallelism. The empirical evaluation across multiple pipelines and scales is a strength, as is the focus on real hardware (H100/A100) and heterogeneous setups. However, the absence of worst-case analysis for the scheduling heuristic limits the result's generality beyond the tested workloads.

major comments (3)
  1. [Abstract / Experimental results] Abstract and experimental evaluation (throughput claims): the headline speedups (1.85x vs Verl, 1.54x vs StreamRL/AReaL) are reported without any information on run-to-run variance, statistical tests, baseline re-implementations, or controls for confounding factors such as model size, rollout length distribution, or batching strategy. This is load-bearing because the central contribution is an empirical systems improvement whose validity rests on these numbers.
  2. [Sub-Stage Graph and scheduling formulation] Section on Sub-Stage Graph and look-ahead heuristic: no worst-case analysis, convergence bound, or overhead characterization is provided for the dynamic allocation and migration decisions. The claim that these resolve multiplexing interference and long-tail effects with only moderate latency trade-off therefore rests entirely on the specific agentic pipelines tested; heavier-tailed rollout lengths could invalidate the moderate-overhead assertion.
  3. [Heterogeneous pipeline experiments] Heterogeneous pipeline support: the paper asserts that JigsawRL handles heterogeneous pipelines with moderate latency trade-off, yet no quantitative breakdown (e.g., per-pipeline latency histograms or migration frequency) is supplied to substantiate the trade-off magnitude or stability across differing pipeline compositions.
minor comments (2)
  1. [Sub-Stage Graph definition] Notation for the Sub-Stage Graph (nodes, edges, and resource attributes) should be defined more explicitly with a small example diagram or table to aid readability.
  2. [Conclusion / Experiments] The manuscript would benefit from a reproducibility statement indicating whether code, configuration files, or exact workload traces will be released.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We are grateful to the referee for their thorough review and valuable suggestions. We have carefully considered each major comment and provide point-by-point responses below, along with planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Experimental results] Abstract and experimental evaluation (throughput claims): the headline speedups (1.85x vs Verl, 1.54x vs StreamRL/AReaL) are reported without any information on run-to-run variance, statistical tests, baseline re-implementations, or controls for confounding factors such as model size, rollout length distribution, or batching strategy. This is load-bearing because the central contribution is an empirical systems improvement whose validity rests on these numbers.

    Authors: We thank the referee for highlighting this important aspect of empirical validation. While the experimental setup in Section 5 describes the model sizes, rollout length distributions, and batching strategies employed, we acknowledge the lack of reported run-to-run variance and statistical tests. In the revised manuscript, we will include error bars representing standard deviations from at least three independent runs for each throughput measurement and add a note on the baseline re-implementations to improve transparency and address potential confounding factors. revision: yes

  2. Referee: [Sub-Stage Graph and scheduling formulation] Section on Sub-Stage Graph and look-ahead heuristic: no worst-case analysis, convergence bound, or overhead characterization is provided for the dynamic allocation and migration decisions. The claim that these resolve multiplexing interference and long-tail effects with only moderate latency trade-off therefore rests entirely on the specific agentic pipelines tested; heavier-tailed rollout lengths could invalidate the moderate-overhead assertion.

    Authors: We concur that a formal worst-case analysis or convergence bound for the look-ahead heuristic is not included in the current manuscript. Developing such bounds for a practical heuristic operating under dynamic conditions with variable rollout lengths is non-trivial and may not yield tight results. We will revise the paper to include a characterization of the scheduling and migration overheads based on our measurements, along with a discussion of the heuristic's behavior and potential limitations under heavier-tailed distributions. This will better contextualize the empirical results. revision: partial

  3. Referee: [Heterogeneous pipeline experiments] Heterogeneous pipeline support: the paper asserts that JigsawRL handles heterogeneous pipelines with moderate latency trade-off, yet no quantitative breakdown (e.g., per-pipeline latency histograms or migration frequency) is supplied to substantiate the trade-off magnitude or stability across differing pipeline compositions.

    Authors: We will update the heterogeneous pipeline experiments to provide the requested quantitative details. Specifically, we plan to add latency histograms for individual pipelines and report migration frequencies along with their contribution to the observed latency trade-off. These additions will offer a more rigorous substantiation of the moderate overhead in heterogeneous settings. revision: yes

standing simulated objections not resolved
  • Formal worst-case analysis or convergence bounds for the dynamic scheduling heuristic

Circularity Check

0 steps flagged

No circularity: empirical systems framework with direct benchmark measurements

full rationale

The paper introduces JigsawRL as a practical framework for decomposing RL pipelines into a Sub-Stage Graph and applying dynamic allocation plus look-ahead scheduling. All performance claims (throughput gains versus Verl, StreamRL, AReAL) are presented as direct empirical results measured on 4-64 H100/A100 GPUs across concrete agentic workloads and models. No equations, derivations, fitted parameters, or first-principles predictions appear; the central claims rest on runtime observations rather than any reduction to inputs by construction. No self-citation chains or imported uniqueness theorems are load-bearing. The work is therefore self-contained against external baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not mention any free parameters, axioms, or invented entities. The contribution appears to rest on standard assumptions from distributed systems and RL infrastructure that are not detailed here.

pith-pipeline@v0.9.0 · 5457 in / 1184 out tokens · 55289 ms · 2026-05-08T06:32:38.197491+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration

    cs.LG 2026-05 unverdicted novelty 6.0

    FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...

Reference graph

Works this paper leans on

71 extracted references · 43 canonical work pages · cited by 1 Pith paper · 20 internal anchors

  1. [1]

    https://github.com/NVIDIA-NeMo/RL, 2025

    Nemo rl: A scalable and efficient post-training li- brary. https://github.com/NVIDIA-NeMo/RL, 2025. GitHub repository

  2. [2]

    https://www.aicerts.ai/news/alibaba-qwen- model-downloads-metrics-and-enterprise-impact/, 2026

    Alibaba qwen model downloads: Metrics and enterprise impact. https://www.aicerts.ai/news/alibaba-qwen- model-downloads-metrics-and-enterprise-impact/, 2026

  3. [3]

    Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models, 2023

    Marwa Abdulhai, Isadora White, Charlie Snell, Charles Sun, Joey Hong, Yuexiang Zhai, Kelvin Xu, and Sergey Levine. Lmrl gym: Benchmarks for multi-turn reinforce- ment learning with language models.arXiv preprint arXiv:2311.18232, 2023

  4. [4]

    Amazon EC2 On-Demand Pricing

    Amazon Web Services. Amazon EC2 On-Demand Pricing. https://aws.amazon.com/ec2/pricing/ on-demand/. Accessed: 2026-03-21

  5. [5]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforce- ment learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

  6. [6]

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, An- drew McNamara, Bhaskar Mitra, Tri Nguyen, et al. Ms marco: A human generated machine reading comprehen- sion dataset.arXiv preprint arXiv:1611.09268, 2016

  7. [7]

    Respec: Towards optimizing speculative decoding in reinforcement learning systems

    Qiaoling Chen, Zijun Liu, Peng Sun, Shenggui Li, Guoteng Wang, Ziming Liu, Yonggang Wen, Siyuan Feng, and Tianwei Zhang. Respec: Towards optimizing speculative decoding in reinforcement learning systems. arXiv preprint arXiv:2510.26475, 2025

  8. [8]

    Multi-Agent Evolve:

    Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhang, Mostofa Patwary, and Jiaxuan You. Multi-agent evolve: Llm self-improve through co-evolution.arXiv preprint arXiv:2510.23595, 2025

  9. [9]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plap- pert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  10. [10]

    The faiss library.IEEE Transactions on Big Data, 2025

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library.IEEE Transactions on Big Data, 2025

  11. [11]

    Muxserve: flexible spatial-temporal multi- plexing for multiple llm serving.arXiv preprint arXiv:2404.02015, 2024

    Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. Muxserve: flexible spatial-temporal multi- plexing for multiple llm serving.arXiv preprint arXiv:2404.02015, 2024

  12. [12]

    ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

    Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learn- ing for strategic tool use in llms.arXiv preprint arXiv:2504.11536, 2025

  13. [13]

    Areal: A large-scale asynchronous reinforcement learning system for language reasoning.arXiv preprint arXiv:2505.24298,

    Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Ji- ashu Wang, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning. arXiv preprint arXiv:2505.24298, 2025

  14. [14]

    Rollpacker: Mitigating long-tail rollouts for fast, synchronous rl post-training.arXiv preprint arXiv:2509.21009,

    Wei Gao, Yuheng Zhao, Dakai An, Tianyuan Wu, Lunxi Cao, Shaopan Xiong, Ju Huang, Weixun Wang, Siran Yang, Wenbo Su, et al. Rollpacker: Mitigating long-tail rollouts for fast, synchronous rl post-training.arXiv preprint arXiv:2509.21009, 2025

  15. [15]

    Rollart: Scaling agentic rl training via disaggregated infrastructure.arXiv preprint arXiv:2512.22560, 2025

    Wei Gao, Yuheng Zhao, Tianyuan Wu, Shaopan Xiong, Weixun Wang, Dakai An, Lunxi Cao, Dilxat Muhtar, Zichen Liu, Haizhou Zhao, et al. Rollart: Scaling agen- tic rl training via disaggregated infrastructure.arXiv preprint arXiv:2512.22560, 2025

  16. [16]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shi- rong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  17. [17]

    History doesn’t repeat itself but rollouts rhyme: Accelerating reinforce- ment learning with rhymerl

    Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, and Haibo Chen. History doesn’t repeat itself but rollouts rhyme: Accelerating reinforce- ment learning with rhymerl. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume 2, pages 929–945, 2026

  18. [18]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob 13 Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

  19. [19]

    OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

    Jian Hu, Xibin Wu, Zilin Zhu, Weixun Wang, Dehao Zhang, Yu Cao, et al. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 6, 2024

  20. [20]

    Hedrarag: Co- optimizing generation and retrieval for heterogeneous rag workflows

    Zhengding Hu, Vibha Murthy, Zaifeng Pan, Wanlu Li, Xiaoyi Fang, Yufei Ding, and Yuke Wang. Hedrarag: Co- optimizing generation and retrieval for heterogeneous rag workflows. InProceedings of the ACM SIGOPS 31st symposium on operating systems principles, pages 623–638, 2025

  21. [21]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Alek- sander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  22. [22]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  23. [23]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Ser- can Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

  24. [24]

    Efficient memory manage- ment for large language model serving with pagedatten- tion

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory manage- ment for large language model serving with pagedatten- tion. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  25. [25]

    In-the-flow agentic system optimization for effective planning and tool use

    Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, and Pan Lu. In-the-flow agentic system optimization for effective planning and tool use. InInternational Conference on Learning Representations (ICLR), 2026

  26. [26]

    Spec-rl: Accelerating on-policy reinforcement learning via speculative rollouts.arXiv preprint arXiv:2509.23232, 2025a

    Bingshuai Liu, Ante Wang, Zijun Min, Liang Yao, Haibo Zhang, Yang Liu, Anxiang Zeng, and Jinsong Su. Spec- rl: Accelerating on-policy reinforcement learning via speculative rollouts.arXiv preprint arXiv:2509.23232, 2025

  27. [27]

    Conco: Optimizing compilation of concurrent tensor programs on shared gpu

    Jiamin Lu, Jingwei Sun, Yunlong Xu, Peng Sun, and Guangzhong Sun. Conco: Optimizing compilation of concurrent tensor programs on shared gpu. InProceed- ings of the 39th ACM International Conference on Su- percomputing, pages 640–653, 2025

  28. [28]

    Real: Efficient rlhf train- ing of large language models with parameter realloca- tion.Proceedings of Machine Learning and Systems, 7, 2025

    Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. Real: Efficient rlhf train- ing of large language models with parameter realloca- tion.Proceedings of Machine Learning and Systems, 7, 2025

  29. [29]

    arXiv preprint arXiv:2410.18252 , year=

    Michael Noukhovitch, Shengyi Huang, Sophie Xhon- neux, Arian Hosseini, Rishabh Agarwal, and Aaron Courville. Asynchronous rlhf: Faster and more effi- cient off-policy rl for language models.arXiv preprint arXiv:2410.18252, 2024

  30. [30]

    Multi-process service

    NVIDIA. Multi-process service. https://docs. nvidia.com/deploy/mps/index.html, 2022

  31. [31]

    Nvidia mig (multi-instance gpu)

    NVIDIA. Nvidia mig (multi-instance gpu). https://www.nvidia.com/en-us/technologies/ multi-instance-gpu/, 2022

  32. [32]

    Accessed: 2025-03-29

    NVIDIA Corporation.CUDA Driver API: Green Con- texts, 2025. Accessed: 2025-03-29

  33. [33]

    Serverless rl

    OpenPipe. Serverless rl. https://openpipe.ai/ blog/serverless-rl, 2024. Accessed: 2026-03-04

  34. [34]

    Training language models to follow instructions with human feedback.Advances in neural information pro- cessing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information pro- cessing systems, 35:27730–27744, 2022

  35. [35]

    Gorilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems, 37:126544–126565, 2024

    Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems, 37:126544–126565, 2024

  36. [36]

    Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning

    Ruoyu Qin, Weiran He, Weixiao Huang, Yangkun Zhang, Yikai Zhao, Bo Pang, Xinran Xu, Yingdi Shan, Yongwei Wu, and Mingxing Zhang. Seer: Online con- text learning for fast synchronous llm reinforcement learning.arXiv preprint arXiv:2511.14617, 2025

  37. [37]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789, 2023

  38. [38]

    Zero: Memory optimizations toward train- ing trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward train- ing trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

  39. [39]

    arXiv preprint arXiv:2602.19362 , year=

    Daniel Ritter, Owen Oertell, Bradley Guo, Jonathan Chang, Kianté Brantley, and Wen Sun. Llms can learn to reason via off-policy rl.arXiv preprint arXiv:2602.19362, 2026. 14

  40. [40]

    rstar2-agent: Agentic reasoning technical report, 2025

    Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Wei- jiang Xu, Xinyu Guan, Buze Zhang, Bingcheng Dong, Xudong Zhou, Bowen Zhang, et al. rstar2-agent: Agentic reasoning technical report.arXiv preprint arXiv:2508.20722, 2025

  41. [41]

    Multi-turn re- inforcement learning with preference human feedback

    Lior Shani, Aviv Rosenberg, Asaf Cassel, Oran Lang, Daniele Calandriello, Avital Zipori, Hila Noga, Orgad Keller, Bilal Piot, Idan Szpektor, et al. Multi-turn re- inforcement learning with preference human feedback. Advances in Neural Information Processing Systems, 37:118953–118993, 2024

  42. [42]

    Beat the long tail: Distribution-aware speculative decoding for rl training.arXiv preprint arXiv:2511.13841, 2025

    Zelei Shao, Vikranth Srivatsa, Sanjana Srivastava, Qingyang Wu, Alpay Ariyak, Xiaoxia Wu, Ameen Pa- tel, Jue Wang, Percy Liang, Tri Dao, Ce Zhang, Yiy- ing Zhang, Ben Athiwaratkun, Chenfeng Xu, and Junx- iong Wang. Beat the long tail: Distribution-aware speculative decoding for rl training.arXiv preprint arXiv:2511.13841, 2025

  43. [43]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  44. [44]

    arXiv preprint arXiv:2510.12633 , year=

    Guangming Sheng, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, Xiang Li, Chi Zhang, Yanghua Peng, et al. Laminar: A scalable asyn- chronous rl post-training framework.arXiv preprint arXiv:2510.12633, 2025

  45. [45]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

  46. [46]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter lan- guage models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

  47. [47]

    Orchestrrl: Dynamic compute and network orchestration for disaggregated rl.arXiv preprint arXiv:2601.01209, 2026

    Xin Tan, Yicheng Feng, Yu Zhou, Yimin Jiang, Yibo Zhu, and Hong Xu. Orchestrrl: Dynamic compute and network orchestration for disaggregated rl.arXiv preprint arXiv:2601.01209, 2026

  48. [48]

    Tinker documentation: Serverless api framework for modular rl

    Thinking Machines. Tinker documentation: Serverless api framework for modular rl. https://tinker-docs. thinkingmachines.ai/, 2024. Accessed: 2026-03- 04

  49. [49]

    Vagen:reinforcing world model reasoning for multi- turn vlm agents, 2025

    Kangrui Wang*, Pingyue Zhang*, Zihan Wang*, Yaning Gao*, Linjie Li*, Qineng Wang, Hanyang Chen, Chi Wan, Yiping Lu, Zhengyuan Yang, Lijuan Wang, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Yejin Choi, and Manling Li. Vagen:reinforcing world model reasoning for multi- turn vlm agents, 2025

  50. [50]

    Rlhfspec: Breaking the effi- ciency bottleneck in rlhf training via adaptive drafting

    Siqi Wang, Hailong Yang, Junjie Zhu, Xuezhu Wang, Yufan Xu, and Depei Qian. Rlhfspec: Breaking the effi- ciency bottleneck in rlhf training via adaptive drafting. arXiv preprint arXiv:2512.04752, 2025

  51. [51]

    Reinforcement learning optimization for large- scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025

    Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, et al. Reinforcement learning optimization for large-scale learning: An effi- cient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025

  52. [52]

    {WLB-LLM}:{Workload-Balanced} 4d parallelism for large language model training

    Zheng Wang, Anna Cai, Xinfeng Xie, Zaifeng Pan, Yue Guan, Weiwei Chu, Jie Wang, Shikai Li, Jianyu Huang, Chris Cai, et al. {WLB-LLM}:{Workload-Balanced} 4d parallelism for large language model training. In 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pages 785–801, 2025

  53. [53]

    RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

    Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. Ragen: Under- standing self-evolution in llm agents via multi-turn rein- forcement learning.arXiv preprint arXiv:2504.20073, 2025

  54. [54]

    Rlhfless: Serverless computing for efficient rlhf

    Rui Wei, Hanfei Yu, Shubham Jain, Yogarajan Sivaku- mar, Devesh Tiwari, Jian Li, Seung-Jong Park, and Hao Wang. Rlhfless: Serverless computing for efficient rlhf. arXiv preprint arXiv:2602.22718, 2026

  55. [55]

    Rollmux: Phase-level multiplex- ing for disaggregated rl post-training.arXiv preprint arXiv:2512.11306, 2025

    Tianyuan Wu, Lunxi Cao, Yining Wei, Wei Gao, Yuheng Zhao, Dakai An, Shaopan Xiong, Zhiqiang Lv, Ju Huang, Siran Yang, et al. Rollmux: Phase-level multiplex- ing for disaggregated rl post-training.arXiv preprint arXiv:2512.11306, 2025

  56. [56]

    RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs

    Yongji Wu, Xueshen Liu, Haizhong Zheng, Juncheng Gu, Beidi Chen, Z Morley Mao, Arvind Krishnamurthy, and Ion Stoica. Rlboost: Harvesting preemptible re- sources for cost-efficient reinforcement learning on llms. arXiv preprint arXiv:2510.19225, 2025

  57. [57]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025. 15

  58. [58]

    Hotpotqa: A dataset for diverse, ex- plainable multi-hop question answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben- gio, William Cohen, Ruslan Salakhutdinov, and Christo- pher D Manning. Hotpotqa: A dataset for diverse, ex- plainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

  59. [59]

    Webshop: Towards scalable real-world web interaction with grounded language agents.Ad- vances in Neural Information Processing Systems, 35:20744–20757, 2022

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Ad- vances in Neural Information Processing Systems, 35:20744–20757, 2022

  60. [60]

    Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales

    Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes, et al. Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales.arXiv preprint arXiv:2308.01320, 2023

  61. [61]

    TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training

    Chenhao Ye, Huaizheng Zhang, Mingcong Han, Bao- quan Zhong, Xiang Li, Qixiang Chen, Xinyi Zhang, Weidong Zhang, Kaihua Jiang, Wang Zhang, et al. Ten- sorhub: Scalable and elastic weight transfer for llm rl training.arXiv preprint arXiv:2604.09107, 2026

  62. [62]

    Rlinf: Flexible and efficient large-scale reinforcement learning via macro-to-micro flow transformation.arXiv preprint arXiv:2509.15965, 2025

    Chao Yu, Yuanqing Wang, Zhen Guo, Hao Lin, Si Xu, Hongzhi Zang, Quanlu Zhang, Yongji Wu, Chunyang Zhu, Junhao Hu, et al. Rlinf: Flexible and efficient large- scale reinforcement learning via macro-to-micro flow transformation.arXiv preprint arXiv:2509.15965, 2025

  63. [63]

    Orca: A distributed serving system for {Transformer-Based} generative models

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soo- jeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX symposium on operating sys- tems design and implementation (OSDI 22), pages 521– 538, 2022

  64. [64]

    Prism: Unleashing gpu sharing for cost-efficient multi-llm serving, 2025

    Shan Yu, Jiarong Xing, Yifan Qiao, Mingyuan Ma, Yang- min Li, Yang Wang, Shuo Yang, Zhiqiang Xie, Shiyi Cao, Ke Bao, et al. Prism: Unleashing gpu sharing for cost-efficient multi-llm serving.arXiv preprint arXiv:2505.04021, 2025

  65. [65]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Sho- janazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

  66. [66]

    Prosperity before collapse: How far can off-policy rl reach with stale data on llms?arXiv preprint arXiv:2510.01161, 2025

    Haizhong Zheng, Jiawei Zhao, and Beidi Chen. Prosper- ity before collapse: How far can off-policy rl reach with stale data on llms?arXiv preprint arXiv:2510.01161, 2025

  67. [67]

    Sglang: Efficient execution of structured language model pro- grams.Advances in neural information processing sys- tems, 37:62557–62583, 2024

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model pro- grams.Advances in neural information processing sys- tems, 37:62557–62583, 2024

  68. [68]

    Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation.arXiv preprint arXiv:2504.15930,

    Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, et al. Streamrl: Scalable, het- erogeneous, and elastic rl for llms with disaggregated stream generation.arXiv preprint arXiv:2504.15930, 2025

  69. [69]

    Optimizing {RLHF} training for large language models with stage fusion

    Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, et al. Optimizing {RLHF} training for large language models with stage fusion. In 22nd USENIX Symposium on Networked Systems De- sign and Implementation (NSDI 25), pages 489–503, 2025

  70. [70]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

  71. [71]

    slime: An llm post-training framework for rl scaling

    Zilin Zhu, Chengxing Xie, Xin Lv, and slime Contrib- utors. slime: An llm post-training framework for rl scaling. https://github.com/THUDM/slime, 2025. GitHub repository. Corresponding author: Xin Lv. 16