pith. sign in

arxiv: 2606.11867 · v1 · pith:6V524WSHnew · submitted 2026-06-10 · 💻 cs.DC

Harnessing Routing Foresight for Micro-step-level MoE load balancing in RL Post-training

Pith reviewed 2026-06-27 08:33 UTC · model grok-4.3

classification 💻 cs.DC
keywords Mixture-of-ExpertsReinforcement Learning Post-trainingLoad BalancingExpert RoutingDistributed TrainingMicro-step SchedulingGPU Clusters
0
0 comments X

The pith

ForeMoE uses routing foresight from the rollout stage to balance MoE experts at micro-step level in RL post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mixture-of-Experts models during reinforcement learning post-training face expert load imbalance because tiny batch sizes create severe high-frequency fluctuations at the micro-step level even when overall step load stays stable. Prior load-balancing systems that rely on historical statistics cannot keep up with these rapid changes. ForeMoE instead extracts routing decisions made in the rollout stage and uses them to plan expert placement and transfers for the recompute and policy update stages. A hierarchical planner decomposes the complex balancing task into smaller pieces while a transfer engine overlaps moves across CPU-assisted and GPU-direct paths to support frequent reconfigurations. Evaluations on 64 GPUs show this yields up to 1.45 times faster end-to-end training than existing RL post-training systems.

Core claim

ForeMoE is a micro-step-level load balancing system for MoE RL post-training that exploits foreseeable routing information from the rollout stage to proactively guide expert placement and transfers in the recompute and policy update stages. It uses a hierarchical planner to decompose the NP-hard balancing problem and a transfer engine that leverages complementary CPU-assisted and GPU-direct hardware paths for overlapped expert movement, enabling per-micro-step reconfiguration without dependence on historical statistics.

What carries the argument

Foreseeable routing information from the rollout stage, which proactively guides the hierarchical planner and transfer engine for micro-step expert load balancing.

If this is right

  • Load balancing operates at micro-step granularity instead of step-level granularity.
  • The system achieves up to 1.45× speedup over state-of-the-art RL post-training systems on 64 GPUs.
  • Frequent per-micro-step reconfiguration becomes feasible through decomposed planning and overlapped transfers.
  • Balancing no longer depends on historical step-level statistics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The cross-stage foresight pattern could extend to other multi-stage training or serving pipelines where early routing or scheduling decisions inform later resource allocation.
  • Similar techniques might reduce reliance on complex historical modeling when workloads exhibit stable coarse-grained but variable fine-grained behavior.
  • The approach suggests testing whether routing foresight improves balancing in non-RL MoE settings with comparable pipeline structure.

Load-bearing premise

Routing decisions produced during the rollout stage remain sufficiently accurate and low-overhead to usefully guide expert placement and transfer decisions in the subsequent recompute and policy update stages.

What would settle it

A workload where rollout-stage routing decisions diverge substantially from actual loads in later stages, producing higher imbalance or slower training than historical-statistic baselines on the same RL post-training pipeline.

Figures

Figures reproduced from arXiv: 2606.11867 by Bin Cui, Fangcheng Fu, Haoyang Li, Jie Jiang, Sheng Lin, Tong Zhao, Xupeng Miao, Yanfeng Zhao, Yuming Zhou.

Figure 1
Figure 1. Figure 1: MoE layer with expert parallelism (EP). Load imbalance arises because different tokens are routed to different experts. popular experts, leaving others underutilized. This signifi￾cantly degrades overall training performance. Prior efforts to alleviate expert load imbalance largely cen￾ter on the pre-training phase [2, 13, 30, 39, 56, 67, 68, 70]. As [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Expert relocation and replication mechanisms. (a) Origi￾nal placement with severe load imbalance. (b) After jointly applying expert relocation (e.g., E3 and E4) and expert replication (e.g., E2’ and E5’), perfect load balance is achieved. 2 Preliminaries In this section, we provide background on the MoE architec￾ture, examine its load balancing issues, and present the RL post-training pipeline for MoE mode… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of (a) a pre-training step and (b) an RL post-training step for MoE models. (a) During MoE pre-training, routing information is unavailable before execution, requiring existing approaches to predict routing behavior from historical statistics. (b) An RL post-training step consists of rollout, recompute, and policy update stages. The rollout stage generates responses and records routing informati… view at source ↗
Figure 4
Figure 4. Figure 4: Expert load characteristics during RL post-training on Qwen3-30B-A3B. Step-level expert load remains stable but skewed, while micro-step-level expert load exhibits substantial fluctuations. where different experts become specialized in distinct linguis￾tic patterns [22] or knowledge domains [61]. In contrast, RL post-training is typically performed on more concentrated tasks (e.g., mathematics [33, 60] or … view at source ↗
Figure 5
Figure 5. Figure 5: Overview of ForeMoE. The Rollout Collector on each rollout worker collects routing information and feeds it to the Four￾stage Planner (§8). For each micro-step, the planner determines the optimal expert placement and token assignment. The Expert Transfer Engine (§6) then reconfigures expert placement as needed. This decomposition preserves solving quality while substan￾tially reducing the solving overhead.… view at source ↗
Figure 6
Figure 6. Figure 6: Two expert transfer paths. (a) CPU-assisted path: each machine maintains a full copy of expert weights in pinned CPU memory. GPUs prefetch the experts required by the next micro￾step via PCIe. (b) GPU-direct path: expert weights reside on GPUs. Reconfiguration is performed through GPU-to-GPU transfers. subproblems and sequentially performs (1) base expert place￾ment, (2) expert relocation, (3) expert repli… view at source ↗
Figure 7
Figure 7. Figure 7: (a) Micro-step-level reconfiguration for the policy update stage, which includes both forward and backward passes. The forward and backward passes of the same micro-step use the same reconfiguration plan. (b) During both the forward and backward passes, per-layer expert transfer is overlapped with the execution on the main stream through a three-stage procedure. overhead can no longer be easily amortized. … view at source ↗
Figure 8
Figure 8. Figure 8: End-to-end per-step latency across six configurations (a)–(f). Each bar stacks the recompute stage over the policy update stage. Numbers above the bars denote the end-to-end speedup over veRL. (§10.2). We then decompose this speedup to isolate the con￾tributions of individual planning stages and transfer-path choices (§10.3). Next, we study how effectively ForeMoE reshapes rank load and inter-machine traff… view at source ↗
Figure 9
Figure 9. Figure 9: B, L, P, and T denote base expert placement, expert relocation, expert replication, and token assignment, respectively. The end-to-end speedup over veRL is shown on top [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Per-step distribution of (a) the compute imbalance ratio and (b) the maximum inter-machine link traffic over 10 steps. 10.5 Overhead Analysis In this section, we analyze the runtime overhead introduced by ForeMoE’s planning (§8) and expert transfer (§6). Comparison with ForeMoE-opt. We first compare Fore￾MoE with ForeMoE-opt, an idealized variant that executes planning and expert transfer offline, excludi… view at source ↗
Figure 12
Figure 12. Figure 12: (a) Per-step planning time versus stage time. “Rec.” denotes recompute and “Upd.” denotes policy update. (b) Per-layer expert transfer time versus attention time. through expert-workload chunking [53, 54]. Comet [69] fur￾ther develops GPU kernel fusion via thread-block specializa￾tion. These techniques are orthogonal to ForeMoE, as they optimize MoE execution itself and can be directly combined with our l… view at source ↗
read the original abstract

Mixture-of-Experts (MoE) and reinforcement learning (RL) post-training now dominate large language model (LLM) development, yet expert load imbalance remains a critical challenge. Existing load-balancing systems target pre-training by relying on historical step-level statistics. However, these methods fail under the unique workload dynamics of RL post-training: the step-level load is stable, but the tiny batch sizes processed during micro-steps cause severe, high-frequency load fluctuations. We introduce ForeMoE, a micro-step-level load balancing system for MoE RL post-training. Instead of relying on historical statistics, ForeMoE exploits the multi-stage RL pipeline (rollout, recompute, policy update) by using foreseeable routing information from the rollout stage to proactively guide load balancing in the remaining stages. To support frequent per-micro-step reconfiguration, ForeMoE employs a hierarchical planner that decomposes the NP-hard load balancing problem into tractable sub-components, alongside a transfer engine that leverages complementary hardware paths (CPU-assisted and GPU-direct) for overlapped expert transfer. Evaluations on 64 GPUs demonstrate that ForeMoE achieves up to a 1.45$\times$ speedup over state-of-the-art RL post-training systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ForeMoE, a micro-step-level load balancing system for MoE models during RL post-training. It exploits the multi-stage pipeline (rollout, recompute, policy update) by using routing decisions from the rollout stage to proactively configure expert placement and transfers in later stages, employing a hierarchical planner to decompose the NP-hard balancing problem and a transfer engine that overlaps CPU-assisted and GPU-direct paths. On 64 GPUs it reports up to 1.45× speedup over existing RL post-training systems.

Significance. If the routing-foresight assumption holds and the reported speedup is reproducible, the work would provide a concrete systems technique for handling high-frequency load imbalance that is characteristic of RL post-training but not pre-training, potentially improving training throughput for large MoE models without requiring changes to the RL algorithm itself.

major comments (2)
  1. [Abstract] Abstract: the 1.45× speedup is stated without any description of the baselines, workloads, number of runs, or variance; this information is required to assess whether the result supports the central claim.
  2. [Abstract] Abstract: no routing-overlap statistics, load-prediction error, or ablation that isolates the foresight component are supplied, leaving the key assumption—that rollout-stage assignments remain sufficiently stable for recompute and update micro-steps—unsupported despite being load-bearing for the performance argument under the described tiny-batch, high-frequency regime.
minor comments (1)
  1. [Abstract] The abstract introduces the terms 'hierarchical planner' and 'transfer engine' without a one-sentence gloss or pointer to the section that defines them.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate planned revisions to the abstract and manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the 1.45× speedup is stated without any description of the baselines, workloads, number of runs, or variance; this information is required to assess whether the result supports the central claim.

    Authors: We agree that the abstract would benefit from additional context. In the revision we will update the abstract to briefly identify the baselines as state-of-the-art RL post-training systems that rely on historical step-level statistics, note the workloads as tiny-batch MoE RL post-training on 64 GPUs, and state that the 1.45× figure is the maximum observed speedup with full run counts and variance reported in the evaluation section. revision: yes

  2. Referee: [Abstract] Abstract: no routing-overlap statistics, load-prediction error, or ablation that isolates the foresight component are supplied, leaving the key assumption—that rollout-stage assignments remain sufficiently stable for recompute and update micro-steps—unsupported despite being load-bearing for the performance argument under the described tiny-batch, high-frequency regime.

    Authors: We acknowledge the abstract itself does not contain these supporting figures. The manuscript's evaluation section demonstrates the end-to-end benefit of rollout-stage foresight via the reported speedups. We will revise the abstract to reference the observed stability of routing decisions across pipeline stages and will ensure the body includes explicit routing-overlap statistics, load-prediction error measurements, and an ablation isolating the foresight component. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems claim with no self-referential derivation or fitting

full rationale

The paper presents ForeMoE as a systems artifact that exploits rollout-stage routing to guide later pipeline stages, evaluated empirically on 64 GPUs for a 1.45× speedup. No equations, fitted parameters, or mathematical derivations are described in the abstract or claimed chain; the central result is an end-to-end performance measurement rather than a prediction derived from its own inputs. The stability assumption noted by the skeptic is an empirical precondition, not a self-definition or fitted-input reduction. No self-citations are invoked as load-bearing uniqueness theorems. This is a standard non-circular empirical systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

This is an applied systems paper; the central claim rests on the introduction of two new software components (hierarchical planner and transfer engine) rather than on mathematical free parameters or axioms.

invented entities (2)
  • hierarchical planner no independent evidence
    purpose: decompose the NP-hard load balancing problem into tractable sub-components
    New component introduced to support frequent per-micro-step reconfiguration
  • transfer engine no independent evidence
    purpose: leverage complementary hardware paths (CPU-assisted and GPU-direct) for overlapped expert transfer
    New component introduced to support frequent per-micro-step reconfiguration

pith-pipeline@v0.9.1-grok · 5771 in / 1174 out tokens · 26982 ms · 2026-06-27T08:33:26.506323+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 13 canonical work pages

  1. [1]

    Deepep: A high-performance communication library, 2025.https: //github.com/deepseek-ai/DeepEP

  2. [2]

    Expert parallelism load balancer, 2025.https://github.com/deepseek- ai/EPLB

  3. [3]

    A Survey on Mixture of Experts in Large Language Models , ISSN=

    Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts in large language models. IEEE Transactions on Knowledge and Data Engineering, 2025.http: //dx.doi.org/10.1109/TKDE.2025.3554028

  4. [4]

    Respec: Towards optimizing speculative decoding in reinforcement learning systems, 2025.https://arxiv.org/abs/2510.26475

    Qiaoling Chen, Zijun Liu, Peng Sun, Shenggui Li, Guoteng Wang, Ziming Liu, Yonggang Wen, Siyuan Feng, and Tianwei Zhang. Respec: Towards optimizing speculative decoding in reinforcement learning systems, 2025.https://arxiv.org/abs/2510.26475

  5. [5]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025.https: //arxiv.org/abs/2501.12948

    DeepSeek-AI, Daya Guo, Dejian Yang, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025.https: //arxiv.org/abs/2501.12948

  6. [6]

    Deepseek-v3 technical report, 2025.https://arxiv.org/abs/2412.19437

    DeepSeek-AI, Aixin Liu, Bei Feng, et al. Deepseek-v3 technical report, 2025.https://arxiv.org/abs/2412.19437

  7. [7]

    GLaM: Efficient scaling of language models with mixture-of-experts

    Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lep- ikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P Bosma, Zongwei Zhou, Tao Wang, Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc Le, Yonghui Wu, Zhifeng Chen, and...

  8. [8]

    Dapo-math-17k dataset, 2025.https://huggingface.co/ datasets/BytedTsinghua-SIA/DAPO-Math-17k

    Hugging Face. Dapo-math-17k dataset, 2025.https://huggingface.co/ datasets/BytedTsinghua-SIA/DAPO-Math-17k

  9. [9]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 2022.http://jmlr.org/papers/ v23/21-0998.html

  10. [10]

    Areal: A large-scale asynchronous reinforcement learning system for language reasoning, 2025.https://arxiv.org/abs/ 2505.24298

    Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu. Areal: A large-scale asynchronous reinforcement learning system for language reasoning, 2025.https://arxiv.org/abs/ 2505.24298

  11. [11]

    R. L. Graham. Bounds on multiprocessing timing anomalies.SIAM J. Appl. Math., 1969.https://doi.org/10.1137/0117039

  12. [12]

    Asyncflow: An asynchronous streaming rl framework for efficient llm post-training, 2025.https://arxiv.org/abs/2507.01663

    Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, Huazhong Ji, Wenjie Liu, Yu Huang, Yixiang Zhang, Chenyi Pan, Jing Wang, Xin Huang, Chunsheng Li, and Jianping Wu. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training, 2025.https://arxiv.org/abs/2507.01663

  13. [13]

    In: Proceed- ings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp

    Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. Fastermoe: modeling and optimizing training of large-scale dynamic pre-trained models. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2022.https://doi.org/10.1145/3503221.3508418

  14. [14]

    History rhymes: Accelerating llm reinforcement learning with rhymerl, 2025.https://arxiv.org/abs/2508.18588

    Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, and Haibo Chen. History rhymes: Accelerating llm reinforcement learning with rhymerl, 2025.https://arxiv.org/abs/2508.18588

  15. [15]

    Openrlhf: An easy-to-use, scalable and high-performance rlhf framework, 2025

    Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Weikai Fang, Xianyu, Yu Cao, Haotian Xu, and Yiming Liu. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework, 2025. https://arxiv.org/abs/2405.11143

  16. [16]

    Demystifying nccl: An in-depth analysis of gpu communication protocols and algorithms, 2025.https://arxiv.org/abs/2507.04786

    Zhiyi Hu, Siyuan Shen, Tommaso Bonato, Sylvain Jeaugey, Cedell Alexander, Eric Spada, James Dinan, Jeff Hammond, and Torsten Hoe- fler. Demystifying nccl: An in-depth analysis of gpu communication protocols and algorithms, 2025.https://arxiv.org/abs/2507.04786

  17. [17]

    Qerl: Beyond efficiency – quantization-enhanced reinforcement learning for llms, 2025.https: //arxiv.org/abs/2510.11696

    Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xi- aojuan Qi, Song Han, and Yukang Chen. Qerl: Beyond efficiency – quantization-enhanced reinforcement learning for llms, 2025.https: //arxiv.org/abs/2510.11696

  18. [18]

    and Hall, J

    Q. Huangfu and J. A. J. Hall. Parallelizing the dual revised simplex method.Mathematical Programming Computation, 2018.https://doi. org/10.1007/s12532-017-0130-5. 13

  19. [19]

    Tutel: Adap- tive mixture-of-experts at scale

    Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, HoYuen Chau, Peng Cheng, Fan Yang, Mao Yang, and Yongqiang Xiong. Tutel: Adap- tive mixture-of-experts at scale. InProceedings of Machine Learning and Systems, 2023.https://proceedings.mlsys.org/paper_files/paper/ 2023/file/5616d34cf8ff739...

  20. [20]

    Coderl+: Improving code genera- tion via reinforcement with execution semantics alignment, 2026

    Xue Jiang, Yihong Dong, Mengyang Liu, Hongyi Deng, Tian Wang, Yongding Tao, Rongyu Cao, Binhua Li, Zhi Jin, Wenpin Jiao, Fei Huang, Yongbin Li, and Ge Li. Coderl+: Improving code genera- tion via reinforcement with execution semantics alignment, 2026. https://arxiv.org/abs/2510.18471

  21. [21]

    Relibra: Routing- replay-guided load balancing for moe training in reinforcement learn- ing, 2026.https://arxiv.org/abs/2605.08639

    Chao Jin, Xinming Wei, Yinmin Zhong, Chengxu Yang, Bingyang Wu, Ruidong Zhu, Zili Zhang, Yuliang Liu, and Xin Jin. Relibra: Routing- replay-guided load balancing for moe training in reinforcement learn- ing, 2026.https://arxiv.org/abs/2605.08639

  22. [22]

    Towards democratizing LLMs: Investigating mul- tilingual mixture-of-experts models

    Aditi Khandelwal, Marius Mosbach, Verna Dankers, Siva Reddy, and Golnoosh Farnadi. Towards democratizing LLMs: Investigating mul- tilingual mixture-of-experts models. InWomen in Machine Learning Workshop @ NeurIPS 2025, 2026.https://openreview.net/forum?id= Bwf4grCk3H

  23. [23]

    <constraint text>

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, 2023.https://doi.org/10.1145/3600006.3613165

  24. [24]

    {GS}hard: Scaling giant models with conditional com- putation and automatic sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. {GS}hard: Scaling giant models with conditional com- putation and automatic sharding. InInternational Conference on Learning Representations, 2021.https://openreview.net/forum?id= qrwe7XHTmYb

  25. [25]

    Base layers: Simplifying training of large, sparse models

    Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. Base layers: Simplifying training of large, sparse models. InProceedings of the 38th International Conference on Machine Learning, 2021.https://proceedings.mlr.press/v139/lewis21a.html

  26. [26]

    Unleashing effi- cient asynchronous rl post-training via staleness-constrained rollout coordination, 2026.https://arxiv.org/abs/2601.12784

    Haoyang Li, Sheng Lin, Fangcheng Fu, Yuming Zhou, Xiaodong Ji, Yanfeng Zhao, Lefeng Wang, Jie Jiang, and Bin Cui. Unleashing effi- cient asynchronous rl post-training via staleness-constrained rollout coordination, 2026.https://arxiv.org/abs/2601.12784

  27. [27]

    Spec-rl: Accel- erating on-policy reinforcement learning with speculative rollouts, 2026.https://arxiv.org/abs/2509.23232

    Bingshuai Liu, Ante Wang, Zijun Min, Liang Yao, Haibo Zhang, Yang Liu, Xu Han, Peng Li, Anxiang Zeng, and Jinsong Su. Spec-rl: Accel- erating on-policy reinforcement learning with speculative rollouts, 2026.https://arxiv.org/abs/2509.23232

  28. [28]

    Jiacai Liu, Yingru Li, Yuqian Fu, Jiawei Wang, Qian Liu, and Yu Shen. When speed kills stability: Demystifying rl collapse from the inference- training mismatch, 2025.https://yingru.notion.site/When-Speed- Kills-Stability-Demystifying-RL-Collapse-from-the-Inference- Training-Mismatch-271211a558b7808d8b12d403fd15edda

  29. [29]

    Flashrl: 8bit rollouts, full power rl, 2025.https: //fengyao.notion.site/flash-rl

    Liyuan Liu, Feng Yao, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. Flashrl: 8bit rollouts, full power rl, 2025.https: //fengyao.notion.site/flash-rl

  30. [30]

    Laer-moe: Load-adaptive expert re-layout for efficient mixture-of-experts training

    Xinyi Liu, Yujie Wang, Fangcheng Fu, Xuefeng Xiao, Huixia Li, Ji- ashi Li, and Bin Cui. Laer-moe: Load-adaptive expert re-layout for efficient mixture-of-experts training. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2026.https://doi.org/10. 1145/3779212.3790180

  31. [31]

    Part ii: Roll flash – accelerating rlvr and agentic training with asynchrony, 2025.https://arxiv.org/abs/2510.11345

    Han Lu, Zichen Liu, Shaopan Xiong, Yancheng He, Wei Gao, Yanan Wu, Weixun Wang, Jiashun Liu, Yang Li, Haizhou Zhao, Ju Huang, Siran Yang, Xiaoyang Li, Yijia Luo, Zihe Liu, Ling Pan, Junchi Yan, Wei Wang, Wenbo Su, Jiamang Wang, Lin Qu, and Bo Zheng. Part ii: Roll flash – accelerating rlvr and agentic training with asynchrony, 2025.https://arxiv.org/abs/2510.11345

  32. [32]

    Stabilizing moe reinforcement learning by aligning training and inference routers, 2025.https://arxiv.org/abs/ 2510.11370

    Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, and Fuli Luo. Stabilizing moe reinforcement learning by aligning training and inference routers, 2025.https://arxiv.org/abs/ 2510.11370

  33. [33]

    Math-beyond: A benchmark for rl to expand beyond the base model, 2025.https://arxiv.org/abs/2510.11653

    Prasanna Mayilvahanan, Ricardo Dominguez-Olmedo, Thaddäus Wiedemer, and Wieland Brendel. Math-beyond: A benchmark for rl to expand beyond the base model, 2025.https://arxiv.org/abs/2510.11653

  34. [34]

    Jordan, and Ion Stoica

    Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: a distributed framework for emerging ai applications. InProceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, 2018.https://dl.acm. org/doi/10.5555...

  35. [35]

    A comprehensive survey of mixture-of- experts: Algorithms, theory, and applications, 2025.https://arxiv.org/ abs/2503.07137

    Siyuan Mu and Sen Lin. A comprehensive survey of mixture-of- experts: Algorithms, theory, and applications, 2025.https://arxiv.org/ abs/2503.07137

  36. [36]

    Pipedream: generalized pipeline parallelism for dnn training,

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. Pipedream: generalized pipeline parallelism for dnn train- ing. InProceedings of the 27th ACM Symposium on Operating Systems Principles, 2019.https://doi.org/10.1145/3341301.3359646

  37. [37]

    Memory-efficient pipeline-parallel dnn training

    Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. Memory-efficient pipeline-parallel dnn training. In Proceedings of the 38th International Conference on Machine Learning, 2021.https://proceedings.mlr.press/v139/narayanan21a.html

  38. [38]

    Efficient large-scale language model training on gpu clusters using megatron-lm,

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm. InProceedings of the International Conference for High Perfo...

  39. [39]

    Flexmoe: Scaling large-scale sparse pre-trained model training via dynamic device placement.Proc

    Xiaonan Nie, Xupeng Miao, Zilong Wang, Zichao Yang, Jilong Xue, Lingxiao Ma, Gang Cao, and Bin Cui. Flexmoe: Scaling large-scale sparse pre-trained model training via dynamic device placement.Proc. ACM Manag. Data, 2023.https://doi.org/10.1145/3588964

  40. [40]

    Asynchronous rlhf: Faster and more efficient off-policy rl for language models, 2025

    Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hos- seini, Rishabh Agarwal, and Aaron Courville. Asynchronous rlhf: Faster and more efficient off-policy rl for language models, 2025. https://arxiv.org/abs/2410.18252

  41. [41]

    Nvidia collective communication library (nccl) documen- tation, 2025.https://docs.nvidia.com/deeplearning/nccl/user-guide/ docs/index.html

    NVIDIA. Nvidia collective communication library (nccl) documen- tation, 2025.https://docs.nvidia.com/deeplearning/nccl/user-guide/ docs/index.html

  42. [42]

    Codeforces, 2025.https://huggingface.co/datasets/open-r1/codeforces

    Guilherme Penedo, Anton Lozhkov, Hynek Kydlíček, Loubna Ben Allal, Edward Beeching, Agustín Piqueres Lajarín, Quentin Gallouédec, Nathan Habib, Lewis Tunstall, and Leandro von Werra. Codeforces, 2025.https://huggingface.co/datasets/open-r1/codeforces

  43. [43]

    Seer: Online context learning for fast synchronous llm reinforcement learning, 2025.https://arxiv.org/abs/2511.14617

    Ruoyu Qin, Weiran He, Weixiao Huang, Yangkun Zhang, Yikai Zhao, Bo Pang, Xinran Xu, Yingdi Shan, Yongwei Wu, and Mingxing Zhang. Seer: Online context learning for fast synchronous llm reinforcement learning, 2025.https://arxiv.org/abs/2511.14617

  44. [44]

    Qwen2.5 technical report, 2025.https://arxiv.org/abs/2412.15115

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, et al. Qwen2.5 technical report, 2025.https://arxiv.org/abs/2412.15115

  45. [45]

    Tyrell Rockafellar.Convex Analysis

    R. Tyrell Rockafellar.Convex Analysis. Princeton University Press, 1970.http://www.jstor.org/stable/j.ctt14bs1ff

  46. [46]

    Hash layers for large sparse models

    Stephen Roller, Sainbayar Sukhbaatar, arthur szlam, and Jason Weston. Hash layers for large sparse models. InAdvances in Neural Information Processing Systems, 2021.https://proceedings.neurips.cc/paper_files/ paper/2021/file/92bf5e6240737e0326ea59846a83e076-Paper.pdf

  47. [47]

    Proximal policy optimization algorithms, 2017.https: //arxiv.org/abs/1707.06347

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017.https: //arxiv.org/abs/1707.06347

  48. [48]

    Beat the long tail: Distribution-aware speculative decoding for rl training, 2025.https://arxiv.org/abs/2511.13841

    Zelei Shao, Vikranth Srivatsa, Sanjana Srivastava, Qingyang Wu, Al- pay Ariyak, Xiaoxia Wu, Ameen Patel, Jue Wang, Percy Liang, Tri 14 Dao, Ce Zhang, Yiying Zhang, Ben Athiwaratkun, Chenfeng Xu, and Junxiong Wang. Beat the long tail: Distribution-aware speculative decoding for rl training, 2025.https://arxiv.org/abs/2511.13841

  49. [49]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.https://arxiv.org/abs/2402.03300

  50. [50]

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, 2017.https: //arxiv.org/abs/1701.06538

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, 2017.https: //arxiv.org/abs/1701.06538

  51. [51]

    Laminar: A scalable asynchronous rl post-training framework, 2025.https://arxiv.org/abs/2510.12633

    Guangming Sheng, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, Xiang Li, Chi Zhang, Yanghua Peng, Haibin Lin, Xin Liu, and Chuan Wu. Laminar: A scalable asynchronous rl post-training framework, 2025.https://arxiv.org/abs/2510.12633

  52. [52]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, 2025.http://dx.doi.org/10. 1145/3689031.3696075

  53. [53]

    Anycostfl: Efficient on-demand federated learning over heterogeneous edge devices

    Shaohuai Shi, Xinglin Pan, Xiaowen Chu, and Bo Li. Pipemoe: Ac- celerating mixture-of-experts through adaptive pipelining. InIEEE INFOCOM 2023 - IEEE Conference on Computer Communications, 2023. https://doi.org/10.1109/INFOCOM53939.2023.10228874

  54. [54]

    Donaldson, John Wickerson, and Manuel Rigger

    Shaohuai Shi, Xinglin Pan, Qiang Wang, Chengjian Liu, Xiaozhe Ren, Zhongzhe Hu, Yu Yang, Bo Li, and Xiaowen Chu. Schemoe: An ex- tensible mixture-of-experts distributed training system with tasks scheduling. InProceedings of the Nineteenth European Conference on Computer Systems, 2024.https://doi.org/10.1145/3627703.3650083

  55. [55]

    Megatron-lm: Training multi- billion parameter language models using model parallelism, 2020

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi- billion parameter language models using model parallelism, 2020. https://arxiv.org/abs/1909.08053

  56. [56]

    SYMI: Efficient Mixture-of-Experts training via model and optimizer state decoupling

    Athinagoras Skiadopoulos, Mark Zhao, Swapnil Gandhi, Thomas Norrie, Shrijeet Mukherjee, and Christos Kozyrakis. SYMI: Efficient Mixture-of-Experts training via model and optimizer state decoupling. In23rd USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 26), 2026.https://www.usenix.org/conference/nsdi26/ presentation/skiadopoulos

  57. [57]

    Kimi k2.5: Visual agentic intelligence, 2026.https://arxiv.org/abs/2602.02276

    Kimi Team, Tongtong Bai, Yifan Bai, et al. Kimi k2.5: Visual agentic intelligence, 2026.https://arxiv.org/abs/2602.02276

  58. [58]

    Kimi k2: Open agentic intelli- gence, 2025.https://arxiv.org/abs/2507.20534

    Kimi Team, Yifan Bai, Yiping Bao, et al. Kimi k2: Open agentic intelli- gence, 2025.https://arxiv.org/abs/2507.20534

  59. [59]

    Philippe Tillet, H. T. Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InPro- ceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 2019.https://doi.org/10.1145/ 3315508.3329973

  60. [60]

    A survey on large language models for mathematical reasoning.ACM Comput

    Peng-Yuan Wang, Tian-Shuo Liu, Chenyang Wang, Ziniu Li, Yidi Wang, Shu Yan, Chengxing Jia, Xu-Hui Liu, Xinwei Chen, Jiacheng Xu, and Yang Yu. A survey on large language models for mathematical reasoning.ACM Comput. Surv., 2026.https://dl.acm.org/doi/10.1145/ 3786333

  61. [61]

    The myth of expert specialization in moes: Why routing reflects geometry, not necessarily domain expertise, 2026.https://arxiv.org/abs/2604.09780

    Xi Wang, Soufiane Hayou, and Eric Nalisnick. The myth of expert specialization in moes: Why routing reflects geometry, not necessarily domain expertise, 2026.https://arxiv.org/abs/2604.09780

  62. [62]

    Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I. Wang. Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution, 2025.https://arxiv.org/abs/2502. 18449

  63. [63]

    Llamarl: A distributed asynchronous reinforcement learning framework for efficient large- scale llm training, 2025.https://arxiv.org/abs/2505.24034

    Bo Wu, Sid Wang, Yunhao Tang, Jia Ding, Eryk Helenowski, Liang Tan, Tengyu Xu, Tushar Gowda, Zhengxing Chen, Chen Zhu, Xiaocheng Tang, Yundi Qian, Beibei Zhu, and Rui Hou. Llamarl: A distributed asynchronous reinforcement learning framework for efficient large- scale llm training, 2025.https://arxiv.org/abs/2505.24034

  64. [64]

    Qwen3 technical report, 2025.https://arxiv.org/abs/2505.09388

    An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report, 2025.https://arxiv.org/abs/2505.09388

  65. [65]

    Your efficient rl framework secretly brings you off-policy rl training, 2025.https://fengyao.notion.site/off-policy-rl

    Feng Yao, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. Your efficient rl framework secretly brings you off-policy rl training, 2025.https://fengyao.notion.site/off-policy-rl

  66. [66]

    Dapo: An open-source llm reinforcement learning system at scale, 2025.https://arxiv.org/ abs/2503.14476

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  67. [67]

    SmartMoE: Efficiently training Sparsely-Activated models through combining offline and online parallelization

    Mingshu Zhai, Jiaao He, Zixuan Ma, Zan Zong, Runqing Zhang, and Jidong Zhai. SmartMoE: Efficiently training Sparsely-Activated models through combining offline and online parallelization. In2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023.https://www. usenix.org/conference/atc23/presentation/zhai

  68. [68]

    PopFetcher: Towards ac- celerated Mixture-of-Experts training via popularity based Expert- Wise prefetch

    Junyi Zhang, Chuanhu Ma, Xiong Wang, Yuntao Nie, Yuqing Li, Yue- dong Xu, Xiaofei Liao, Bo Li, and Hai Jin. PopFetcher: Towards ac- celerated Mixture-of-Experts training via popularity based Expert- Wise prefetch. In2025 USENIX Annual Technical Conference (USENIX ATC 25), 2025.https://www.usenix.org/conference/atc25/presentation/ zhang-junyi

  69. [69]

    Comet: Fine- grained computation-communication overlapping for mixture-of- experts

    Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wen- lei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, and Xin Liu. Comet: Fine- grained computation-communication overlapping for mixture-of- experts. InProceedings of Machine Learning and Systems, 2025.https://proceedings.mlsys.org/paper_files/paper/2025/file/ e27ea0...

  70. [70]

    Fine-grained moe load balancing with linear programming, 2026.https: //arxiv.org/abs/2511.16947

    Chenqi Zhao, Wenfei Wu, Linhai Song, Yuchen Xu, and Yitao Yuan. Fine-grained moe load balancing with linear programming, 2026.https: //arxiv.org/abs/2511.16947

  71. [71]

    Small leak can sink a great ship–boost rl training on moe with icepop!, 2025.https://ringtech.notion.site/icepop

    Xin Zhao, Yongkang Liu, Kuan Xu, Jia Guo, Zihao Wang, Yan Sun, Xinyu Kong, Qianggang Cao, Liang Jiang, Zujie Wen, Zhiqiang Zhang, and Jun Zhou. Small leak can sink a great ship–boost rl training on moe with icepop!, 2025.https://ringtech.notion.site/icepop

  72. [72]

    doi: 10.14778/3611540.3611569.https: //doi.org/10.14778/3611540.3611569

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. Pytorch fsdp: Experiences on scaling fully sharded data parallel.Proc. VLDB Endow., 2023.https://doi.org/10.147...

  73. [73]

    Stabilizing reinforcement learning with llms: Formulation and practices, 2025.https://arxiv.org/abs/2512

    Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Jun- rong Lin, Yuqiong Liu, Hao Lin, Chencan Wu, Feng Hu, An Yang, Jingren Zhou, and Junyang Lin. Stabilizing reinforcement learning with llms: Formulation and practices, 2025.https://arxiv.org/abs/2512. 01374

  74. [74]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: efficient execution of structured language model programs. InProceedings of the 38th International Conference on Neural Information Processing Systems, 2024.https://dl.acm.o...

  75. [75]

    Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation, 2025.https://arxiv.org/abs/2504.15930

    Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, Hongyu Zhou, Yimin Jiang, Yibo Zhu, and Daxin Jiang. Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation, 2025.https://arxiv.org/abs/2504.15930. 15

  76. [76]

    the gradient of 𝑒

    Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, zhifeng Chen, Quoc V Le, and James Laudon. Mixture-of-experts with expert choice rout- ing. InAdvances in Neural Information Processing Systems, 2022.https://proceedings.neurips.cc/paper_files/paper/2022/file/ 2f00ecd787b432c1d36f3de9800728eb-Paper-Conference.pdf. 16 A De...