Harnessing Routing Foresight for Micro-step-level MoE load balancing in RL Post-training

Bin Cui; Fangcheng Fu; Haoyang Li; Jie Jiang; Sheng Lin; Tong Zhao; Xupeng Miao; Yanfeng Zhao; Yuming Zhou

arxiv: 2606.11867 · v1 · pith:6V524WSHnew · submitted 2026-06-10 · 💻 cs.DC

Harnessing Routing Foresight for Micro-step-level MoE load balancing in RL Post-training

Yuming Zhou , Haoyang Li , Sheng Lin , Yanfeng Zhao , Tong Zhao , Xupeng Miao , Jie Jiang , Fangcheng Fu

show 1 more author

Bin Cui

This is my paper

Pith reviewed 2026-06-27 08:33 UTC · model grok-4.3

classification 💻 cs.DC

keywords Mixture-of-ExpertsReinforcement Learning Post-trainingLoad BalancingExpert RoutingDistributed TrainingMicro-step SchedulingGPU Clusters

0 comments

The pith

ForeMoE uses routing foresight from the rollout stage to balance MoE experts at micro-step level in RL post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mixture-of-Experts models during reinforcement learning post-training face expert load imbalance because tiny batch sizes create severe high-frequency fluctuations at the micro-step level even when overall step load stays stable. Prior load-balancing systems that rely on historical statistics cannot keep up with these rapid changes. ForeMoE instead extracts routing decisions made in the rollout stage and uses them to plan expert placement and transfers for the recompute and policy update stages. A hierarchical planner decomposes the complex balancing task into smaller pieces while a transfer engine overlaps moves across CPU-assisted and GPU-direct paths to support frequent reconfigurations. Evaluations on 64 GPUs show this yields up to 1.45 times faster end-to-end training than existing RL post-training systems.

Core claim

ForeMoE is a micro-step-level load balancing system for MoE RL post-training that exploits foreseeable routing information from the rollout stage to proactively guide expert placement and transfers in the recompute and policy update stages. It uses a hierarchical planner to decompose the NP-hard balancing problem and a transfer engine that leverages complementary CPU-assisted and GPU-direct hardware paths for overlapped expert movement, enabling per-micro-step reconfiguration without dependence on historical statistics.

What carries the argument

Foreseeable routing information from the rollout stage, which proactively guides the hierarchical planner and transfer engine for micro-step expert load balancing.

If this is right

Load balancing operates at micro-step granularity instead of step-level granularity.
The system achieves up to 1.45× speedup over state-of-the-art RL post-training systems on 64 GPUs.
Frequent per-micro-step reconfiguration becomes feasible through decomposed planning and overlapped transfers.
Balancing no longer depends on historical step-level statistics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The cross-stage foresight pattern could extend to other multi-stage training or serving pipelines where early routing or scheduling decisions inform later resource allocation.
Similar techniques might reduce reliance on complex historical modeling when workloads exhibit stable coarse-grained but variable fine-grained behavior.
The approach suggests testing whether routing foresight improves balancing in non-RL MoE settings with comparable pipeline structure.

Load-bearing premise

Routing decisions produced during the rollout stage remain sufficiently accurate and low-overhead to usefully guide expert placement and transfer decisions in the subsequent recompute and policy update stages.

What would settle it

A workload where rollout-stage routing decisions diverge substantially from actual loads in later stages, producing higher imbalance or slower training than historical-statistic baselines on the same RL post-training pipeline.

Figures

Figures reproduced from arXiv: 2606.11867 by Bin Cui, Fangcheng Fu, Haoyang Li, Jie Jiang, Sheng Lin, Tong Zhao, Xupeng Miao, Yanfeng Zhao, Yuming Zhou.

**Figure 1.** Figure 1: MoE layer with expert parallelism (EP). Load imbalance arises because different tokens are routed to different experts. popular experts, leaving others underutilized. This significantly degrades overall training performance. Prior efforts to alleviate expert load imbalance largely center on the pre-training phase [2, 13, 30, 39, 56, 67, 68, 70]. As [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Expert relocation and replication mechanisms. (a) Original placement with severe load imbalance. (b) After jointly applying expert relocation (e.g., E3 and E4) and expert replication (e.g., E2’ and E5’), perfect load balance is achieved. 2 Preliminaries In this section, we provide background on the MoE architecture, examine its load balancing issues, and present the RL post-training pipeline for MoE mode… view at source ↗

**Figure 3.** Figure 3: Comparison of (a) a pre-training step and (b) an RL post-training step for MoE models. (a) During MoE pre-training, routing information is unavailable before execution, requiring existing approaches to predict routing behavior from historical statistics. (b) An RL post-training step consists of rollout, recompute, and policy update stages. The rollout stage generates responses and records routing informati… view at source ↗

**Figure 4.** Figure 4: Expert load characteristics during RL post-training on Qwen3-30B-A3B. Step-level expert load remains stable but skewed, while micro-step-level expert load exhibits substantial fluctuations. where different experts become specialized in distinct linguistic patterns [22] or knowledge domains [61]. In contrast, RL post-training is typically performed on more concentrated tasks (e.g., mathematics [33, 60] or … view at source ↗

**Figure 5.** Figure 5: Overview of ForeMoE. The Rollout Collector on each rollout worker collects routing information and feeds it to the Fourstage Planner (§8). For each micro-step, the planner determines the optimal expert placement and token assignment. The Expert Transfer Engine (§6) then reconfigures expert placement as needed. This decomposition preserves solving quality while substantially reducing the solving overhead.… view at source ↗

**Figure 6.** Figure 6: Two expert transfer paths. (a) CPU-assisted path: each machine maintains a full copy of expert weights in pinned CPU memory. GPUs prefetch the experts required by the next microstep via PCIe. (b) GPU-direct path: expert weights reside on GPUs. Reconfiguration is performed through GPU-to-GPU transfers. subproblems and sequentially performs (1) base expert placement, (2) expert relocation, (3) expert repli… view at source ↗

**Figure 7.** Figure 7: (a) Micro-step-level reconfiguration for the policy update stage, which includes both forward and backward passes. The forward and backward passes of the same micro-step use the same reconfiguration plan. (b) During both the forward and backward passes, per-layer expert transfer is overlapped with the execution on the main stream through a three-stage procedure. overhead can no longer be easily amortized. … view at source ↗

**Figure 8.** Figure 8: End-to-end per-step latency across six configurations (a)–(f). Each bar stacks the recompute stage over the policy update stage. Numbers above the bars denote the end-to-end speedup over veRL. (§10.2). We then decompose this speedup to isolate the contributions of individual planning stages and transfer-path choices (§10.3). Next, we study how effectively ForeMoE reshapes rank load and inter-machine traff… view at source ↗

**Figure 9.** Figure 9: B, L, P, and T denote base expert placement, expert relocation, expert replication, and token assignment, respectively. The end-to-end speedup over veRL is shown on top [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Per-step distribution of (a) the compute imbalance ratio and (b) the maximum inter-machine link traffic over 10 steps. 10.5 Overhead Analysis In this section, we analyze the runtime overhead introduced by ForeMoE’s planning (§8) and expert transfer (§6). Comparison with ForeMoE-opt. We first compare ForeMoE with ForeMoE-opt, an idealized variant that executes planning and expert transfer offline, excludi… view at source ↗

**Figure 12.** Figure 12: (a) Per-step planning time versus stage time. “Rec.” denotes recompute and “Upd.” denotes policy update. (b) Per-layer expert transfer time versus attention time. through expert-workload chunking [53, 54]. Comet [69] further develops GPU kernel fusion via thread-block specialization. These techniques are orthogonal to ForeMoE, as they optimize MoE execution itself and can be directly combined with our l… view at source ↗

read the original abstract

Mixture-of-Experts (MoE) and reinforcement learning (RL) post-training now dominate large language model (LLM) development, yet expert load imbalance remains a critical challenge. Existing load-balancing systems target pre-training by relying on historical step-level statistics. However, these methods fail under the unique workload dynamics of RL post-training: the step-level load is stable, but the tiny batch sizes processed during micro-steps cause severe, high-frequency load fluctuations. We introduce ForeMoE, a micro-step-level load balancing system for MoE RL post-training. Instead of relying on historical statistics, ForeMoE exploits the multi-stage RL pipeline (rollout, recompute, policy update) by using foreseeable routing information from the rollout stage to proactively guide load balancing in the remaining stages. To support frequent per-micro-step reconfiguration, ForeMoE employs a hierarchical planner that decomposes the NP-hard load balancing problem into tractable sub-components, alongside a transfer engine that leverages complementary hardware paths (CPU-assisted and GPU-direct) for overlapped expert transfer. Evaluations on 64 GPUs demonstrate that ForeMoE achieves up to a 1.45$\times$ speedup over state-of-the-art RL post-training systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ForeMoE's rollout-foresight approach for micro-step MoE balancing in RL post-training targets a real workload gap, but the 1.45x speedup rests on an unverified stability assumption with no supporting numbers in the provided abstract.

read the letter

The paper introduces ForeMoE to address expert load imbalance in MoE RL post-training. It exploits the multi-stage pipeline by taking routing decisions from rollout and using them to guide placement and transfers in recompute and policy update micro-steps, rather than depending on historical step-level stats.

This is a reasonable distinction. RL post-training does have tiny batches and high-frequency fluctuations that pre-training balancers miss, and the hierarchical planner plus dual-path transfer engine (CPU-assisted and GPU-direct) are practical responses to the need for frequent reconfigurations.

The evaluation claims up to 1.45x speedup over prior RL systems on 64 GPUs. If the full experiments include solid baselines, variance numbers, and controls, that would be a useful data point for people running large-scale training.

The soft spot is the missing evidence on the central assumption. The abstract gives no routing overlap stats, load prediction error, or ablation that isolates the foresight component. If routing assignments shift between rollout and later stages because of batch or policy differences, the planner could misplace experts or add transfer overhead, erasing the gains. That concern from the stress-test note lands directly on the claim.

The work is aimed at systems engineers building distributed MoE training stacks, especially those handling RL fine-tuning at scale. Readers in that group would find the design choices worth examining.

It deserves peer review because the problem is timely and the solution is concrete, even though the paper will need stronger empirical backing on prediction stability before the speedup can be taken at face value.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ForeMoE, a micro-step-level load balancing system for MoE models during RL post-training. It exploits the multi-stage pipeline (rollout, recompute, policy update) by using routing decisions from the rollout stage to proactively configure expert placement and transfers in later stages, employing a hierarchical planner to decompose the NP-hard balancing problem and a transfer engine that overlaps CPU-assisted and GPU-direct paths. On 64 GPUs it reports up to 1.45× speedup over existing RL post-training systems.

Significance. If the routing-foresight assumption holds and the reported speedup is reproducible, the work would provide a concrete systems technique for handling high-frequency load imbalance that is characteristic of RL post-training but not pre-training, potentially improving training throughput for large MoE models without requiring changes to the RL algorithm itself.

major comments (2)

[Abstract] Abstract: the 1.45× speedup is stated without any description of the baselines, workloads, number of runs, or variance; this information is required to assess whether the result supports the central claim.
[Abstract] Abstract: no routing-overlap statistics, load-prediction error, or ablation that isolates the foresight component are supplied, leaving the key assumption—that rollout-stage assignments remain sufficiently stable for recompute and update micro-steps—unsupported despite being load-bearing for the performance argument under the described tiny-batch, high-frequency regime.

minor comments (1)

[Abstract] The abstract introduces the terms 'hierarchical planner' and 'transfer engine' without a one-sentence gloss or pointer to the section that defines them.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate planned revisions to the abstract and manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the 1.45× speedup is stated without any description of the baselines, workloads, number of runs, or variance; this information is required to assess whether the result supports the central claim.

Authors: We agree that the abstract would benefit from additional context. In the revision we will update the abstract to briefly identify the baselines as state-of-the-art RL post-training systems that rely on historical step-level statistics, note the workloads as tiny-batch MoE RL post-training on 64 GPUs, and state that the 1.45× figure is the maximum observed speedup with full run counts and variance reported in the evaluation section. revision: yes
Referee: [Abstract] Abstract: no routing-overlap statistics, load-prediction error, or ablation that isolates the foresight component are supplied, leaving the key assumption—that rollout-stage assignments remain sufficiently stable for recompute and update micro-steps—unsupported despite being load-bearing for the performance argument under the described tiny-batch, high-frequency regime.

Authors: We acknowledge the abstract itself does not contain these supporting figures. The manuscript's evaluation section demonstrates the end-to-end benefit of rollout-stage foresight via the reported speedups. We will revise the abstract to reference the observed stability of routing decisions across pipeline stages and will ensure the body includes explicit routing-overlap statistics, load-prediction error measurements, and an ablation isolating the foresight component. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems claim with no self-referential derivation or fitting

full rationale

The paper presents ForeMoE as a systems artifact that exploits rollout-stage routing to guide later pipeline stages, evaluated empirically on 64 GPUs for a 1.45× speedup. No equations, fitted parameters, or mathematical derivations are described in the abstract or claimed chain; the central result is an end-to-end performance measurement rather than a prediction derived from its own inputs. The stability assumption noted by the skeptic is an empirical precondition, not a self-definition or fitted-input reduction. No self-citations are invoked as load-bearing uniqueness theorems. This is a standard non-circular empirical systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

This is an applied systems paper; the central claim rests on the introduction of two new software components (hierarchical planner and transfer engine) rather than on mathematical free parameters or axioms.

invented entities (2)

hierarchical planner no independent evidence
purpose: decompose the NP-hard load balancing problem into tractable sub-components
New component introduced to support frequent per-micro-step reconfiguration
transfer engine no independent evidence
purpose: leverage complementary hardware paths (CPU-assisted and GPU-direct) for overlapped expert transfer
New component introduced to support frequent per-micro-step reconfiguration

pith-pipeline@v0.9.1-grok · 5771 in / 1174 out tokens · 26982 ms · 2026-06-27T08:33:26.506323+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 13 canonical work pages

[1]

Deepep: A high-performance communication library, 2025.https: //github.com/deepseek-ai/DeepEP

2025
[2]

Expert parallelism load balancer, 2025.https://github.com/deepseek- ai/EPLB

2025
[3]

A Survey on Mixture of Experts in Large Language Models , ISSN=

Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts in large language models. IEEE Transactions on Knowledge and Data Engineering, 2025.http: //dx.doi.org/10.1109/TKDE.2025.3554028

work page doi:10.1109/tkde.2025.3554028 2025
[4]

Respec: Towards optimizing speculative decoding in reinforcement learning systems, 2025.https://arxiv.org/abs/2510.26475

Qiaoling Chen, Zijun Liu, Peng Sun, Shenggui Li, Guoteng Wang, Ziming Liu, Yonggang Wen, Siyuan Feng, and Tianwei Zhang. Respec: Towards optimizing speculative decoding in reinforcement learning systems, 2025.https://arxiv.org/abs/2510.26475

arXiv 2025
[5]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025.https: //arxiv.org/abs/2501.12948

DeepSeek-AI, Daya Guo, Dejian Yang, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025.https: //arxiv.org/abs/2501.12948

Pith/arXiv arXiv 2025
[6]

Deepseek-v3 technical report, 2025.https://arxiv.org/abs/2412.19437

DeepSeek-AI, Aixin Liu, Bei Feng, et al. Deepseek-v3 technical report, 2025.https://arxiv.org/abs/2412.19437

Pith/arXiv arXiv 2025
[7]

GLaM: Efficient scaling of language models with mixture-of-experts

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lep- ikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P Bosma, Zongwei Zhou, Tao Wang, Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc Le, Yonghui Wu, Zhifeng Chen, and...

2022
[8]

Dapo-math-17k dataset, 2025.https://huggingface.co/ datasets/BytedTsinghua-SIA/DAPO-Math-17k

Hugging Face. Dapo-math-17k dataset, 2025.https://huggingface.co/ datasets/BytedTsinghua-SIA/DAPO-Math-17k

2025
[9]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 2022.http://jmlr.org/papers/ v23/21-0998.html

2022
[10]

Areal: A large-scale asynchronous reinforcement learning system for language reasoning, 2025.https://arxiv.org/abs/ 2505.24298

Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu. Areal: A large-scale asynchronous reinforcement learning system for language reasoning, 2025.https://arxiv.org/abs/ 2505.24298

Pith/arXiv arXiv 2025
[11]

R. L. Graham. Bounds on multiprocessing timing anomalies.SIAM J. Appl. Math., 1969.https://doi.org/10.1137/0117039

work page doi:10.1137/0117039 1969
[12]

Asyncflow: An asynchronous streaming rl framework for efficient llm post-training, 2025.https://arxiv.org/abs/2507.01663

Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, Huazhong Ji, Wenjie Liu, Yu Huang, Yixiang Zhang, Chenyi Pan, Jing Wang, Xin Huang, Chunsheng Li, and Jianping Wu. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training, 2025.https://arxiv.org/abs/2507.01663

arXiv 2025
[13]

In: Proceed- ings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp

Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. Fastermoe: modeling and optimizing training of large-scale dynamic pre-trained models. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2022.https://doi.org/10.1145/3503221.3508418

work page doi:10.1145/3503221.3508418 2022
[14]

History rhymes: Accelerating llm reinforcement learning with rhymerl, 2025.https://arxiv.org/abs/2508.18588

Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, and Haibo Chen. History rhymes: Accelerating llm reinforcement learning with rhymerl, 2025.https://arxiv.org/abs/2508.18588

arXiv 2025
[15]

Openrlhf: An easy-to-use, scalable and high-performance rlhf framework, 2025

Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Weikai Fang, Xianyu, Yu Cao, Haotian Xu, and Yiming Liu. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework, 2025. https://arxiv.org/abs/2405.11143

Pith/arXiv arXiv 2025
[16]

Demystifying nccl: An in-depth analysis of gpu communication protocols and algorithms, 2025.https://arxiv.org/abs/2507.04786

Zhiyi Hu, Siyuan Shen, Tommaso Bonato, Sylvain Jeaugey, Cedell Alexander, Eric Spada, James Dinan, Jeff Hammond, and Torsten Hoe- fler. Demystifying nccl: An in-depth analysis of gpu communication protocols and algorithms, 2025.https://arxiv.org/abs/2507.04786

arXiv 2025
[17]

Qerl: Beyond efficiency – quantization-enhanced reinforcement learning for llms, 2025.https: //arxiv.org/abs/2510.11696

Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xi- aojuan Qi, Song Han, and Yukang Chen. Qerl: Beyond efficiency – quantization-enhanced reinforcement learning for llms, 2025.https: //arxiv.org/abs/2510.11696

arXiv 2025
[18]

and Hall, J

Q. Huangfu and J. A. J. Hall. Parallelizing the dual revised simplex method.Mathematical Programming Computation, 2018.https://doi. org/10.1007/s12532-017-0130-5. 13

work page doi:10.1007/s12532-017-0130-5 2018
[19]

Tutel: Adap- tive mixture-of-experts at scale

Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, HoYuen Chau, Peng Cheng, Fan Yang, Mao Yang, and Yongqiang Xiong. Tutel: Adap- tive mixture-of-experts at scale. InProceedings of Machine Learning and Systems, 2023.https://proceedings.mlsys.org/paper_files/paper/ 2023/file/5616d34cf8ff739...

2023
[20]

Coderl+: Improving code genera- tion via reinforcement with execution semantics alignment, 2026

Xue Jiang, Yihong Dong, Mengyang Liu, Hongyi Deng, Tian Wang, Yongding Tao, Rongyu Cao, Binhua Li, Zhi Jin, Wenpin Jiao, Fei Huang, Yongbin Li, and Ge Li. Coderl+: Improving code genera- tion via reinforcement with execution semantics alignment, 2026. https://arxiv.org/abs/2510.18471

Pith/arXiv arXiv 2026
[21]

Relibra: Routing- replay-guided load balancing for moe training in reinforcement learn- ing, 2026.https://arxiv.org/abs/2605.08639

Chao Jin, Xinming Wei, Yinmin Zhong, Chengxu Yang, Bingyang Wu, Ruidong Zhu, Zili Zhang, Yuliang Liu, and Xin Jin. Relibra: Routing- replay-guided load balancing for moe training in reinforcement learn- ing, 2026.https://arxiv.org/abs/2605.08639

Pith/arXiv arXiv 2026
[22]

Towards democratizing LLMs: Investigating mul- tilingual mixture-of-experts models

Aditi Khandelwal, Marius Mosbach, Verna Dankers, Siva Reddy, and Golnoosh Farnadi. Towards democratizing LLMs: Investigating mul- tilingual mixture-of-experts models. InWomen in Machine Learning Workshop @ NeurIPS 2025, 2026.https://openreview.net/forum?id= Bwf4grCk3H

2025
[23]

<constraint text>

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, 2023.https://doi.org/10.1145/3600006.3613165

work page doi:10.1145/3600006.3613165 2023
[24]

{GS}hard: Scaling giant models with conditional com- putation and automatic sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. {GS}hard: Scaling giant models with conditional com- putation and automatic sharding. InInternational Conference on Learning Representations, 2021.https://openreview.net/forum?id= qrwe7XHTmYb

2021
[25]

Base layers: Simplifying training of large, sparse models

Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. Base layers: Simplifying training of large, sparse models. InProceedings of the 38th International Conference on Machine Learning, 2021.https://proceedings.mlr.press/v139/lewis21a.html

2021
[26]

Unleashing effi- cient asynchronous rl post-training via staleness-constrained rollout coordination, 2026.https://arxiv.org/abs/2601.12784

Haoyang Li, Sheng Lin, Fangcheng Fu, Yuming Zhou, Xiaodong Ji, Yanfeng Zhao, Lefeng Wang, Jie Jiang, and Bin Cui. Unleashing effi- cient asynchronous rl post-training via staleness-constrained rollout coordination, 2026.https://arxiv.org/abs/2601.12784

arXiv 2026
[27]

Spec-rl: Accel- erating on-policy reinforcement learning with speculative rollouts, 2026.https://arxiv.org/abs/2509.23232

Bingshuai Liu, Ante Wang, Zijun Min, Liang Yao, Haibo Zhang, Yang Liu, Xu Han, Peng Li, Anxiang Zeng, and Jinsong Su. Spec-rl: Accel- erating on-policy reinforcement learning with speculative rollouts, 2026.https://arxiv.org/abs/2509.23232

arXiv 2026
[28]

Jiacai Liu, Yingru Li, Yuqian Fu, Jiawei Wang, Qian Liu, and Yu Shen. When speed kills stability: Demystifying rl collapse from the inference- training mismatch, 2025.https://yingru.notion.site/When-Speed- Kills-Stability-Demystifying-RL-Collapse-from-the-Inference- Training-Mismatch-271211a558b7808d8b12d403fd15edda

2025
[29]

Flashrl: 8bit rollouts, full power rl, 2025.https: //fengyao.notion.site/flash-rl

Liyuan Liu, Feng Yao, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. Flashrl: 8bit rollouts, full power rl, 2025.https: //fengyao.notion.site/flash-rl

2025
[30]

Laer-moe: Load-adaptive expert re-layout for efficient mixture-of-experts training

Xinyi Liu, Yujie Wang, Fangcheng Fu, Xuefeng Xiao, Huixia Li, Ji- ashi Li, and Bin Cui. Laer-moe: Load-adaptive expert re-layout for efficient mixture-of-experts training. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2026.https://doi.org/10. 1145/3779212.3790180

arXiv 2026
[31]

Part ii: Roll flash – accelerating rlvr and agentic training with asynchrony, 2025.https://arxiv.org/abs/2510.11345

Han Lu, Zichen Liu, Shaopan Xiong, Yancheng He, Wei Gao, Yanan Wu, Weixun Wang, Jiashun Liu, Yang Li, Haizhou Zhao, Ju Huang, Siran Yang, Xiaoyang Li, Yijia Luo, Zihe Liu, Ling Pan, Junchi Yan, Wei Wang, Wenbo Su, Jiamang Wang, Lin Qu, and Bo Zheng. Part ii: Roll flash – accelerating rlvr and agentic training with asynchrony, 2025.https://arxiv.org/abs/2510.11345

arXiv 2025
[32]

Stabilizing moe reinforcement learning by aligning training and inference routers, 2025.https://arxiv.org/abs/ 2510.11370

Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, and Fuli Luo. Stabilizing moe reinforcement learning by aligning training and inference routers, 2025.https://arxiv.org/abs/ 2510.11370

arXiv 2025
[33]

Math-beyond: A benchmark for rl to expand beyond the base model, 2025.https://arxiv.org/abs/2510.11653

Prasanna Mayilvahanan, Ricardo Dominguez-Olmedo, Thaddäus Wiedemer, and Wieland Brendel. Math-beyond: A benchmark for rl to expand beyond the base model, 2025.https://arxiv.org/abs/2510.11653

arXiv 2025
[34]

Jordan, and Ion Stoica

Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: a distributed framework for emerging ai applications. InProceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, 2018.https://dl.acm. org/doi/10.5555...

work page doi:10.5555/3291168.3291210 2018
[35]

A comprehensive survey of mixture-of- experts: Algorithms, theory, and applications, 2025.https://arxiv.org/ abs/2503.07137

Siyuan Mu and Sen Lin. A comprehensive survey of mixture-of- experts: Algorithms, theory, and applications, 2025.https://arxiv.org/ abs/2503.07137

arXiv 2025
[36]

Pipedream: generalized pipeline parallelism for dnn training,

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. Pipedream: generalized pipeline parallelism for dnn train- ing. InProceedings of the 27th ACM Symposium on Operating Systems Principles, 2019.https://doi.org/10.1145/3341301.3359646

work page doi:10.1145/3341301.3359646 2019
[37]

Memory-efficient pipeline-parallel dnn training

Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. Memory-efficient pipeline-parallel dnn training. In Proceedings of the 38th International Conference on Machine Learning, 2021.https://proceedings.mlr.press/v139/narayanan21a.html

2021
[38]

Efficient large-scale language model training on gpu clusters using megatron-lm,

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm. InProceedings of the International Conference for High Perfo...

work page doi:10.1145/3458817.3476209 2021
[39]

Flexmoe: Scaling large-scale sparse pre-trained model training via dynamic device placement.Proc

Xiaonan Nie, Xupeng Miao, Zilong Wang, Zichao Yang, Jilong Xue, Lingxiao Ma, Gang Cao, and Bin Cui. Flexmoe: Scaling large-scale sparse pre-trained model training via dynamic device placement.Proc. ACM Manag. Data, 2023.https://doi.org/10.1145/3588964

work page doi:10.1145/3588964 2023
[40]

Asynchronous rlhf: Faster and more efficient off-policy rl for language models, 2025

Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hos- seini, Rishabh Agarwal, and Aaron Courville. Asynchronous rlhf: Faster and more efficient off-policy rl for language models, 2025. https://arxiv.org/abs/2410.18252

arXiv 2025
[41]

Nvidia collective communication library (nccl) documen- tation, 2025.https://docs.nvidia.com/deeplearning/nccl/user-guide/ docs/index.html

NVIDIA. Nvidia collective communication library (nccl) documen- tation, 2025.https://docs.nvidia.com/deeplearning/nccl/user-guide/ docs/index.html

2025
[42]

Codeforces, 2025.https://huggingface.co/datasets/open-r1/codeforces

Guilherme Penedo, Anton Lozhkov, Hynek Kydlíček, Loubna Ben Allal, Edward Beeching, Agustín Piqueres Lajarín, Quentin Gallouédec, Nathan Habib, Lewis Tunstall, and Leandro von Werra. Codeforces, 2025.https://huggingface.co/datasets/open-r1/codeforces

2025
[43]

Seer: Online context learning for fast synchronous llm reinforcement learning, 2025.https://arxiv.org/abs/2511.14617

Ruoyu Qin, Weiran He, Weixiao Huang, Yangkun Zhang, Yikai Zhao, Bo Pang, Xinran Xu, Yingdi Shan, Yongwei Wu, and Mingxing Zhang. Seer: Online context learning for fast synchronous llm reinforcement learning, 2025.https://arxiv.org/abs/2511.14617

Pith/arXiv arXiv 2025
[44]

Qwen2.5 technical report, 2025.https://arxiv.org/abs/2412.15115

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, et al. Qwen2.5 technical report, 2025.https://arxiv.org/abs/2412.15115

Pith/arXiv arXiv 2025
[45]

Tyrell Rockafellar.Convex Analysis

R. Tyrell Rockafellar.Convex Analysis. Princeton University Press, 1970.http://www.jstor.org/stable/j.ctt14bs1ff

1970
[46]

Hash layers for large sparse models

Stephen Roller, Sainbayar Sukhbaatar, arthur szlam, and Jason Weston. Hash layers for large sparse models. InAdvances in Neural Information Processing Systems, 2021.https://proceedings.neurips.cc/paper_files/ paper/2021/file/92bf5e6240737e0326ea59846a83e076-Paper.pdf

2021
[47]

Proximal policy optimization algorithms, 2017.https: //arxiv.org/abs/1707.06347

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017.https: //arxiv.org/abs/1707.06347

Pith/arXiv arXiv 2017
[48]

Beat the long tail: Distribution-aware speculative decoding for rl training, 2025.https://arxiv.org/abs/2511.13841

Zelei Shao, Vikranth Srivatsa, Sanjana Srivastava, Qingyang Wu, Al- pay Ariyak, Xiaoxia Wu, Ameen Patel, Jue Wang, Percy Liang, Tri 14 Dao, Ce Zhang, Yiying Zhang, Ben Athiwaratkun, Chenfeng Xu, and Junxiong Wang. Beat the long tail: Distribution-aware speculative decoding for rl training, 2025.https://arxiv.org/abs/2511.13841

arXiv 2025
[49]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.https://arxiv.org/abs/2402.03300

Pith/arXiv arXiv 2024
[50]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, 2017.https: //arxiv.org/abs/1701.06538

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, 2017.https: //arxiv.org/abs/1701.06538

Pith/arXiv arXiv 2017
[51]

Laminar: A scalable asynchronous rl post-training framework, 2025.https://arxiv.org/abs/2510.12633

Guangming Sheng, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, Xiang Li, Chi Zhang, Yanghua Peng, Haibin Lin, Xin Liu, and Chuan Wu. Laminar: A scalable asynchronous rl post-training framework, 2025.https://arxiv.org/abs/2510.12633

arXiv 2025
[52]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, 2025.http://dx.doi.org/10. 1145/3689031.3696075

arXiv 2025
[53]

Anycostfl: Efficient on-demand federated learning over heterogeneous edge devices

Shaohuai Shi, Xinglin Pan, Xiaowen Chu, and Bo Li. Pipemoe: Ac- celerating mixture-of-experts through adaptive pipelining. InIEEE INFOCOM 2023 - IEEE Conference on Computer Communications, 2023. https://doi.org/10.1109/INFOCOM53939.2023.10228874

work page doi:10.1109/infocom53939.2023.10228874 2023
[54]

Donaldson, John Wickerson, and Manuel Rigger

Shaohuai Shi, Xinglin Pan, Qiang Wang, Chengjian Liu, Xiaozhe Ren, Zhongzhe Hu, Yu Yang, Bo Li, and Xiaowen Chu. Schemoe: An ex- tensible mixture-of-experts distributed training system with tasks scheduling. InProceedings of the Nineteenth European Conference on Computer Systems, 2024.https://doi.org/10.1145/3627703.3650083

work page doi:10.1145/3627703.3650083 2024
[55]

Megatron-lm: Training multi- billion parameter language models using model parallelism, 2020

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi- billion parameter language models using model parallelism, 2020. https://arxiv.org/abs/1909.08053

Pith/arXiv arXiv 2020
[56]

SYMI: Efficient Mixture-of-Experts training via model and optimizer state decoupling

Athinagoras Skiadopoulos, Mark Zhao, Swapnil Gandhi, Thomas Norrie, Shrijeet Mukherjee, and Christos Kozyrakis. SYMI: Efficient Mixture-of-Experts training via model and optimizer state decoupling. In23rd USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 26), 2026.https://www.usenix.org/conference/nsdi26/ presentation/skiadopoulos

2026
[57]

Kimi k2.5: Visual agentic intelligence, 2026.https://arxiv.org/abs/2602.02276

Kimi Team, Tongtong Bai, Yifan Bai, et al. Kimi k2.5: Visual agentic intelligence, 2026.https://arxiv.org/abs/2602.02276

Pith/arXiv arXiv 2026
[58]

Kimi k2: Open agentic intelli- gence, 2025.https://arxiv.org/abs/2507.20534

Kimi Team, Yifan Bai, Yiping Bao, et al. Kimi k2: Open agentic intelli- gence, 2025.https://arxiv.org/abs/2507.20534

Pith/arXiv arXiv 2025
[59]

Philippe Tillet, H. T. Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InPro- ceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 2019.https://doi.org/10.1145/ 3315508.3329973

arXiv 2019
[60]

A survey on large language models for mathematical reasoning.ACM Comput

Peng-Yuan Wang, Tian-Shuo Liu, Chenyang Wang, Ziniu Li, Yidi Wang, Shu Yan, Chengxing Jia, Xu-Hui Liu, Xinwei Chen, Jiacheng Xu, and Yang Yu. A survey on large language models for mathematical reasoning.ACM Comput. Surv., 2026.https://dl.acm.org/doi/10.1145/ 3786333

2026
[61]

The myth of expert specialization in moes: Why routing reflects geometry, not necessarily domain expertise, 2026.https://arxiv.org/abs/2604.09780

Xi Wang, Soufiane Hayou, and Eric Nalisnick. The myth of expert specialization in moes: Why routing reflects geometry, not necessarily domain expertise, 2026.https://arxiv.org/abs/2604.09780

Pith/arXiv arXiv 2026
[62]

Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I. Wang. Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution, 2025.https://arxiv.org/abs/2502. 18449

2025
[63]

Llamarl: A distributed asynchronous reinforcement learning framework for efficient large- scale llm training, 2025.https://arxiv.org/abs/2505.24034

Bo Wu, Sid Wang, Yunhao Tang, Jia Ding, Eryk Helenowski, Liang Tan, Tengyu Xu, Tushar Gowda, Zhengxing Chen, Chen Zhu, Xiaocheng Tang, Yundi Qian, Beibei Zhu, and Rui Hou. Llamarl: A distributed asynchronous reinforcement learning framework for efficient large- scale llm training, 2025.https://arxiv.org/abs/2505.24034

arXiv 2025
[64]

Qwen3 technical report, 2025.https://arxiv.org/abs/2505.09388

An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report, 2025.https://arxiv.org/abs/2505.09388

Pith/arXiv arXiv 2025
[65]

Your efficient rl framework secretly brings you off-policy rl training, 2025.https://fengyao.notion.site/off-policy-rl

Feng Yao, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. Your efficient rl framework secretly brings you off-policy rl training, 2025.https://fengyao.notion.site/off-policy-rl

2025
[66]

Dapo: An open-source llm reinforcement learning system at scale, 2025.https://arxiv.org/ abs/2503.14476

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

Pith/arXiv arXiv 2025
[67]

SmartMoE: Efficiently training Sparsely-Activated models through combining offline and online parallelization

Mingshu Zhai, Jiaao He, Zixuan Ma, Zan Zong, Runqing Zhang, and Jidong Zhai. SmartMoE: Efficiently training Sparsely-Activated models through combining offline and online parallelization. In2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023.https://www. usenix.org/conference/atc23/presentation/zhai

2023
[68]

PopFetcher: Towards ac- celerated Mixture-of-Experts training via popularity based Expert- Wise prefetch

Junyi Zhang, Chuanhu Ma, Xiong Wang, Yuntao Nie, Yuqing Li, Yue- dong Xu, Xiaofei Liao, Bo Li, and Hai Jin. PopFetcher: Towards ac- celerated Mixture-of-Experts training via popularity based Expert- Wise prefetch. In2025 USENIX Annual Technical Conference (USENIX ATC 25), 2025.https://www.usenix.org/conference/atc25/presentation/ zhang-junyi

2025
[69]

Comet: Fine- grained computation-communication overlapping for mixture-of- experts

Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wen- lei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, and Xin Liu. Comet: Fine- grained computation-communication overlapping for mixture-of- experts. InProceedings of Machine Learning and Systems, 2025.https://proceedings.mlsys.org/paper_files/paper/2025/file/ e27ea0...

2025
[70]

Fine-grained moe load balancing with linear programming, 2026.https: //arxiv.org/abs/2511.16947

Chenqi Zhao, Wenfei Wu, Linhai Song, Yuchen Xu, and Yitao Yuan. Fine-grained moe load balancing with linear programming, 2026.https: //arxiv.org/abs/2511.16947

arXiv 2026
[71]

Small leak can sink a great ship–boost rl training on moe with icepop!, 2025.https://ringtech.notion.site/icepop

Xin Zhao, Yongkang Liu, Kuan Xu, Jia Guo, Zihao Wang, Yan Sun, Xinyu Kong, Qianggang Cao, Liang Jiang, Zujie Wen, Zhiqiang Zhang, and Jun Zhou. Small leak can sink a great ship–boost rl training on moe with icepop!, 2025.https://ringtech.notion.site/icepop

2025
[72]

doi: 10.14778/3611540.3611569.https: //doi.org/10.14778/3611540.3611569

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. Pytorch fsdp: Experiences on scaling fully sharded data parallel.Proc. VLDB Endow., 2023.https://doi.org/10.147...

work page doi:10.14778/3611540.3611569 2023
[73]

Stabilizing reinforcement learning with llms: Formulation and practices, 2025.https://arxiv.org/abs/2512

Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Jun- rong Lin, Yuqiong Liu, Hao Lin, Chencan Wu, Feng Hu, An Yang, Jingren Zhou, and Junyang Lin. Stabilizing reinforcement learning with llms: Formulation and practices, 2025.https://arxiv.org/abs/2512. 01374

2025
[74]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: efficient execution of structured language model programs. InProceedings of the 38th International Conference on Neural Information Processing Systems, 2024.https://dl.acm.o...

work page doi:10.5555/3737916.3739916 2024
[75]

Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation, 2025.https://arxiv.org/abs/2504.15930

Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, Hongyu Zhou, Yimin Jiang, Yibo Zhu, and Daxin Jiang. Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation, 2025.https://arxiv.org/abs/2504.15930. 15

arXiv 2025
[76]

the gradient of 𝑒

Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, zhifeng Chen, Quoc V Le, and James Laudon. Mixture-of-experts with expert choice rout- ing. InAdvances in Neural Information Processing Systems, 2022.https://proceedings.neurips.cc/paper_files/paper/2022/file/ 2f00ecd787b432c1d36f3de9800728eb-Paper-Conference.pdf. 16 A De...

2022

[1] [1]

Deepep: A high-performance communication library, 2025.https: //github.com/deepseek-ai/DeepEP

2025

[2] [2]

Expert parallelism load balancer, 2025.https://github.com/deepseek- ai/EPLB

2025

[3] [3]

A Survey on Mixture of Experts in Large Language Models , ISSN=

Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts in large language models. IEEE Transactions on Knowledge and Data Engineering, 2025.http: //dx.doi.org/10.1109/TKDE.2025.3554028

work page doi:10.1109/tkde.2025.3554028 2025

[4] [4]

Respec: Towards optimizing speculative decoding in reinforcement learning systems, 2025.https://arxiv.org/abs/2510.26475

Qiaoling Chen, Zijun Liu, Peng Sun, Shenggui Li, Guoteng Wang, Ziming Liu, Yonggang Wen, Siyuan Feng, and Tianwei Zhang. Respec: Towards optimizing speculative decoding in reinforcement learning systems, 2025.https://arxiv.org/abs/2510.26475

arXiv 2025

[5] [5]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025.https: //arxiv.org/abs/2501.12948

DeepSeek-AI, Daya Guo, Dejian Yang, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025.https: //arxiv.org/abs/2501.12948

Pith/arXiv arXiv 2025

[6] [6]

Deepseek-v3 technical report, 2025.https://arxiv.org/abs/2412.19437

DeepSeek-AI, Aixin Liu, Bei Feng, et al. Deepseek-v3 technical report, 2025.https://arxiv.org/abs/2412.19437

Pith/arXiv arXiv 2025

[7] [7]

GLaM: Efficient scaling of language models with mixture-of-experts

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lep- ikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P Bosma, Zongwei Zhou, Tao Wang, Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc Le, Yonghui Wu, Zhifeng Chen, and...

2022

[8] [8]

Dapo-math-17k dataset, 2025.https://huggingface.co/ datasets/BytedTsinghua-SIA/DAPO-Math-17k

Hugging Face. Dapo-math-17k dataset, 2025.https://huggingface.co/ datasets/BytedTsinghua-SIA/DAPO-Math-17k

2025

[9] [9]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 2022.http://jmlr.org/papers/ v23/21-0998.html

2022

[10] [10]

Areal: A large-scale asynchronous reinforcement learning system for language reasoning, 2025.https://arxiv.org/abs/ 2505.24298

Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu. Areal: A large-scale asynchronous reinforcement learning system for language reasoning, 2025.https://arxiv.org/abs/ 2505.24298

Pith/arXiv arXiv 2025

[11] [11]

R. L. Graham. Bounds on multiprocessing timing anomalies.SIAM J. Appl. Math., 1969.https://doi.org/10.1137/0117039

work page doi:10.1137/0117039 1969

[12] [12]

Asyncflow: An asynchronous streaming rl framework for efficient llm post-training, 2025.https://arxiv.org/abs/2507.01663

Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, Huazhong Ji, Wenjie Liu, Yu Huang, Yixiang Zhang, Chenyi Pan, Jing Wang, Xin Huang, Chunsheng Li, and Jianping Wu. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training, 2025.https://arxiv.org/abs/2507.01663

arXiv 2025

[13] [13]

In: Proceed- ings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp

Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. Fastermoe: modeling and optimizing training of large-scale dynamic pre-trained models. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2022.https://doi.org/10.1145/3503221.3508418

work page doi:10.1145/3503221.3508418 2022

[14] [14]

History rhymes: Accelerating llm reinforcement learning with rhymerl, 2025.https://arxiv.org/abs/2508.18588

Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, and Haibo Chen. History rhymes: Accelerating llm reinforcement learning with rhymerl, 2025.https://arxiv.org/abs/2508.18588

arXiv 2025

[15] [15]

Openrlhf: An easy-to-use, scalable and high-performance rlhf framework, 2025

Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Weikai Fang, Xianyu, Yu Cao, Haotian Xu, and Yiming Liu. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework, 2025. https://arxiv.org/abs/2405.11143

Pith/arXiv arXiv 2025

[16] [16]

Demystifying nccl: An in-depth analysis of gpu communication protocols and algorithms, 2025.https://arxiv.org/abs/2507.04786

Zhiyi Hu, Siyuan Shen, Tommaso Bonato, Sylvain Jeaugey, Cedell Alexander, Eric Spada, James Dinan, Jeff Hammond, and Torsten Hoe- fler. Demystifying nccl: An in-depth analysis of gpu communication protocols and algorithms, 2025.https://arxiv.org/abs/2507.04786

arXiv 2025

[17] [17]

Qerl: Beyond efficiency – quantization-enhanced reinforcement learning for llms, 2025.https: //arxiv.org/abs/2510.11696

Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xi- aojuan Qi, Song Han, and Yukang Chen. Qerl: Beyond efficiency – quantization-enhanced reinforcement learning for llms, 2025.https: //arxiv.org/abs/2510.11696

arXiv 2025

[18] [18]

and Hall, J

Q. Huangfu and J. A. J. Hall. Parallelizing the dual revised simplex method.Mathematical Programming Computation, 2018.https://doi. org/10.1007/s12532-017-0130-5. 13

work page doi:10.1007/s12532-017-0130-5 2018

[19] [19]

Tutel: Adap- tive mixture-of-experts at scale

Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, HoYuen Chau, Peng Cheng, Fan Yang, Mao Yang, and Yongqiang Xiong. Tutel: Adap- tive mixture-of-experts at scale. InProceedings of Machine Learning and Systems, 2023.https://proceedings.mlsys.org/paper_files/paper/ 2023/file/5616d34cf8ff739...

2023

[20] [20]

Coderl+: Improving code genera- tion via reinforcement with execution semantics alignment, 2026

Xue Jiang, Yihong Dong, Mengyang Liu, Hongyi Deng, Tian Wang, Yongding Tao, Rongyu Cao, Binhua Li, Zhi Jin, Wenpin Jiao, Fei Huang, Yongbin Li, and Ge Li. Coderl+: Improving code genera- tion via reinforcement with execution semantics alignment, 2026. https://arxiv.org/abs/2510.18471

Pith/arXiv arXiv 2026

[21] [21]

Relibra: Routing- replay-guided load balancing for moe training in reinforcement learn- ing, 2026.https://arxiv.org/abs/2605.08639

Chao Jin, Xinming Wei, Yinmin Zhong, Chengxu Yang, Bingyang Wu, Ruidong Zhu, Zili Zhang, Yuliang Liu, and Xin Jin. Relibra: Routing- replay-guided load balancing for moe training in reinforcement learn- ing, 2026.https://arxiv.org/abs/2605.08639

Pith/arXiv arXiv 2026

[22] [22]

Towards democratizing LLMs: Investigating mul- tilingual mixture-of-experts models

Aditi Khandelwal, Marius Mosbach, Verna Dankers, Siva Reddy, and Golnoosh Farnadi. Towards democratizing LLMs: Investigating mul- tilingual mixture-of-experts models. InWomen in Machine Learning Workshop @ NeurIPS 2025, 2026.https://openreview.net/forum?id= Bwf4grCk3H

2025

[23] [23]

<constraint text>

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, 2023.https://doi.org/10.1145/3600006.3613165

work page doi:10.1145/3600006.3613165 2023

[24] [24]

{GS}hard: Scaling giant models with conditional com- putation and automatic sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. {GS}hard: Scaling giant models with conditional com- putation and automatic sharding. InInternational Conference on Learning Representations, 2021.https://openreview.net/forum?id= qrwe7XHTmYb

2021

[25] [25]

Base layers: Simplifying training of large, sparse models

Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. Base layers: Simplifying training of large, sparse models. InProceedings of the 38th International Conference on Machine Learning, 2021.https://proceedings.mlr.press/v139/lewis21a.html

2021

[26] [26]

Unleashing effi- cient asynchronous rl post-training via staleness-constrained rollout coordination, 2026.https://arxiv.org/abs/2601.12784

Haoyang Li, Sheng Lin, Fangcheng Fu, Yuming Zhou, Xiaodong Ji, Yanfeng Zhao, Lefeng Wang, Jie Jiang, and Bin Cui. Unleashing effi- cient asynchronous rl post-training via staleness-constrained rollout coordination, 2026.https://arxiv.org/abs/2601.12784

arXiv 2026

[27] [27]

Spec-rl: Accel- erating on-policy reinforcement learning with speculative rollouts, 2026.https://arxiv.org/abs/2509.23232

Bingshuai Liu, Ante Wang, Zijun Min, Liang Yao, Haibo Zhang, Yang Liu, Xu Han, Peng Li, Anxiang Zeng, and Jinsong Su. Spec-rl: Accel- erating on-policy reinforcement learning with speculative rollouts, 2026.https://arxiv.org/abs/2509.23232

arXiv 2026

[28] [28]

Jiacai Liu, Yingru Li, Yuqian Fu, Jiawei Wang, Qian Liu, and Yu Shen. When speed kills stability: Demystifying rl collapse from the inference- training mismatch, 2025.https://yingru.notion.site/When-Speed- Kills-Stability-Demystifying-RL-Collapse-from-the-Inference- Training-Mismatch-271211a558b7808d8b12d403fd15edda

2025

[29] [29]

Flashrl: 8bit rollouts, full power rl, 2025.https: //fengyao.notion.site/flash-rl

Liyuan Liu, Feng Yao, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. Flashrl: 8bit rollouts, full power rl, 2025.https: //fengyao.notion.site/flash-rl

2025

[30] [30]

Laer-moe: Load-adaptive expert re-layout for efficient mixture-of-experts training

Xinyi Liu, Yujie Wang, Fangcheng Fu, Xuefeng Xiao, Huixia Li, Ji- ashi Li, and Bin Cui. Laer-moe: Load-adaptive expert re-layout for efficient mixture-of-experts training. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2026.https://doi.org/10. 1145/3779212.3790180

arXiv 2026

[31] [31]

Part ii: Roll flash – accelerating rlvr and agentic training with asynchrony, 2025.https://arxiv.org/abs/2510.11345

Han Lu, Zichen Liu, Shaopan Xiong, Yancheng He, Wei Gao, Yanan Wu, Weixun Wang, Jiashun Liu, Yang Li, Haizhou Zhao, Ju Huang, Siran Yang, Xiaoyang Li, Yijia Luo, Zihe Liu, Ling Pan, Junchi Yan, Wei Wang, Wenbo Su, Jiamang Wang, Lin Qu, and Bo Zheng. Part ii: Roll flash – accelerating rlvr and agentic training with asynchrony, 2025.https://arxiv.org/abs/2510.11345

arXiv 2025

[32] [32]

Stabilizing moe reinforcement learning by aligning training and inference routers, 2025.https://arxiv.org/abs/ 2510.11370

Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, and Fuli Luo. Stabilizing moe reinforcement learning by aligning training and inference routers, 2025.https://arxiv.org/abs/ 2510.11370

arXiv 2025

[33] [33]

Math-beyond: A benchmark for rl to expand beyond the base model, 2025.https://arxiv.org/abs/2510.11653

Prasanna Mayilvahanan, Ricardo Dominguez-Olmedo, Thaddäus Wiedemer, and Wieland Brendel. Math-beyond: A benchmark for rl to expand beyond the base model, 2025.https://arxiv.org/abs/2510.11653

arXiv 2025

[34] [34]

Jordan, and Ion Stoica

Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: a distributed framework for emerging ai applications. InProceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, 2018.https://dl.acm. org/doi/10.5555...

work page doi:10.5555/3291168.3291210 2018

[35] [35]

A comprehensive survey of mixture-of- experts: Algorithms, theory, and applications, 2025.https://arxiv.org/ abs/2503.07137

Siyuan Mu and Sen Lin. A comprehensive survey of mixture-of- experts: Algorithms, theory, and applications, 2025.https://arxiv.org/ abs/2503.07137

arXiv 2025

[36] [36]

Pipedream: generalized pipeline parallelism for dnn training,

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. Pipedream: generalized pipeline parallelism for dnn train- ing. InProceedings of the 27th ACM Symposium on Operating Systems Principles, 2019.https://doi.org/10.1145/3341301.3359646

work page doi:10.1145/3341301.3359646 2019

[37] [37]

Memory-efficient pipeline-parallel dnn training

Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. Memory-efficient pipeline-parallel dnn training. In Proceedings of the 38th International Conference on Machine Learning, 2021.https://proceedings.mlr.press/v139/narayanan21a.html

2021

[38] [38]

Efficient large-scale language model training on gpu clusters using megatron-lm,

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm. InProceedings of the International Conference for High Perfo...

work page doi:10.1145/3458817.3476209 2021

[39] [39]

Flexmoe: Scaling large-scale sparse pre-trained model training via dynamic device placement.Proc

Xiaonan Nie, Xupeng Miao, Zilong Wang, Zichao Yang, Jilong Xue, Lingxiao Ma, Gang Cao, and Bin Cui. Flexmoe: Scaling large-scale sparse pre-trained model training via dynamic device placement.Proc. ACM Manag. Data, 2023.https://doi.org/10.1145/3588964

work page doi:10.1145/3588964 2023

[40] [40]

Asynchronous rlhf: Faster and more efficient off-policy rl for language models, 2025

Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hos- seini, Rishabh Agarwal, and Aaron Courville. Asynchronous rlhf: Faster and more efficient off-policy rl for language models, 2025. https://arxiv.org/abs/2410.18252

arXiv 2025

[41] [41]

Nvidia collective communication library (nccl) documen- tation, 2025.https://docs.nvidia.com/deeplearning/nccl/user-guide/ docs/index.html

NVIDIA. Nvidia collective communication library (nccl) documen- tation, 2025.https://docs.nvidia.com/deeplearning/nccl/user-guide/ docs/index.html

2025

[42] [42]

Codeforces, 2025.https://huggingface.co/datasets/open-r1/codeforces

Guilherme Penedo, Anton Lozhkov, Hynek Kydlíček, Loubna Ben Allal, Edward Beeching, Agustín Piqueres Lajarín, Quentin Gallouédec, Nathan Habib, Lewis Tunstall, and Leandro von Werra. Codeforces, 2025.https://huggingface.co/datasets/open-r1/codeforces

2025

[43] [43]

Seer: Online context learning for fast synchronous llm reinforcement learning, 2025.https://arxiv.org/abs/2511.14617

Ruoyu Qin, Weiran He, Weixiao Huang, Yangkun Zhang, Yikai Zhao, Bo Pang, Xinran Xu, Yingdi Shan, Yongwei Wu, and Mingxing Zhang. Seer: Online context learning for fast synchronous llm reinforcement learning, 2025.https://arxiv.org/abs/2511.14617

Pith/arXiv arXiv 2025

[44] [44]

Qwen2.5 technical report, 2025.https://arxiv.org/abs/2412.15115

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, et al. Qwen2.5 technical report, 2025.https://arxiv.org/abs/2412.15115

Pith/arXiv arXiv 2025

[45] [45]

Tyrell Rockafellar.Convex Analysis

R. Tyrell Rockafellar.Convex Analysis. Princeton University Press, 1970.http://www.jstor.org/stable/j.ctt14bs1ff

1970

[46] [46]

Hash layers for large sparse models

Stephen Roller, Sainbayar Sukhbaatar, arthur szlam, and Jason Weston. Hash layers for large sparse models. InAdvances in Neural Information Processing Systems, 2021.https://proceedings.neurips.cc/paper_files/ paper/2021/file/92bf5e6240737e0326ea59846a83e076-Paper.pdf

2021

[47] [47]

Proximal policy optimization algorithms, 2017.https: //arxiv.org/abs/1707.06347

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017.https: //arxiv.org/abs/1707.06347

Pith/arXiv arXiv 2017

[48] [48]

Beat the long tail: Distribution-aware speculative decoding for rl training, 2025.https://arxiv.org/abs/2511.13841

Zelei Shao, Vikranth Srivatsa, Sanjana Srivastava, Qingyang Wu, Al- pay Ariyak, Xiaoxia Wu, Ameen Patel, Jue Wang, Percy Liang, Tri 14 Dao, Ce Zhang, Yiying Zhang, Ben Athiwaratkun, Chenfeng Xu, and Junxiong Wang. Beat the long tail: Distribution-aware speculative decoding for rl training, 2025.https://arxiv.org/abs/2511.13841

arXiv 2025

[49] [49]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.https://arxiv.org/abs/2402.03300

Pith/arXiv arXiv 2024

[50] [50]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, 2017.https: //arxiv.org/abs/1701.06538

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, 2017.https: //arxiv.org/abs/1701.06538

Pith/arXiv arXiv 2017

[51] [51]

Laminar: A scalable asynchronous rl post-training framework, 2025.https://arxiv.org/abs/2510.12633

Guangming Sheng, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, Xiang Li, Chi Zhang, Yanghua Peng, Haibin Lin, Xin Liu, and Chuan Wu. Laminar: A scalable asynchronous rl post-training framework, 2025.https://arxiv.org/abs/2510.12633

arXiv 2025

[52] [52]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, 2025.http://dx.doi.org/10. 1145/3689031.3696075

arXiv 2025

[53] [53]

Anycostfl: Efficient on-demand federated learning over heterogeneous edge devices

Shaohuai Shi, Xinglin Pan, Xiaowen Chu, and Bo Li. Pipemoe: Ac- celerating mixture-of-experts through adaptive pipelining. InIEEE INFOCOM 2023 - IEEE Conference on Computer Communications, 2023. https://doi.org/10.1109/INFOCOM53939.2023.10228874

work page doi:10.1109/infocom53939.2023.10228874 2023

[54] [54]

Donaldson, John Wickerson, and Manuel Rigger

Shaohuai Shi, Xinglin Pan, Qiang Wang, Chengjian Liu, Xiaozhe Ren, Zhongzhe Hu, Yu Yang, Bo Li, and Xiaowen Chu. Schemoe: An ex- tensible mixture-of-experts distributed training system with tasks scheduling. InProceedings of the Nineteenth European Conference on Computer Systems, 2024.https://doi.org/10.1145/3627703.3650083

work page doi:10.1145/3627703.3650083 2024

[55] [55]

Megatron-lm: Training multi- billion parameter language models using model parallelism, 2020

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi- billion parameter language models using model parallelism, 2020. https://arxiv.org/abs/1909.08053

Pith/arXiv arXiv 2020

[56] [56]

SYMI: Efficient Mixture-of-Experts training via model and optimizer state decoupling

Athinagoras Skiadopoulos, Mark Zhao, Swapnil Gandhi, Thomas Norrie, Shrijeet Mukherjee, and Christos Kozyrakis. SYMI: Efficient Mixture-of-Experts training via model and optimizer state decoupling. In23rd USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 26), 2026.https://www.usenix.org/conference/nsdi26/ presentation/skiadopoulos

2026

[57] [57]

Kimi k2.5: Visual agentic intelligence, 2026.https://arxiv.org/abs/2602.02276

Kimi Team, Tongtong Bai, Yifan Bai, et al. Kimi k2.5: Visual agentic intelligence, 2026.https://arxiv.org/abs/2602.02276

Pith/arXiv arXiv 2026

[58] [58]

Kimi k2: Open agentic intelli- gence, 2025.https://arxiv.org/abs/2507.20534

Kimi Team, Yifan Bai, Yiping Bao, et al. Kimi k2: Open agentic intelli- gence, 2025.https://arxiv.org/abs/2507.20534

Pith/arXiv arXiv 2025

[59] [59]

Philippe Tillet, H. T. Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InPro- ceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 2019.https://doi.org/10.1145/ 3315508.3329973

arXiv 2019

[60] [60]

A survey on large language models for mathematical reasoning.ACM Comput

Peng-Yuan Wang, Tian-Shuo Liu, Chenyang Wang, Ziniu Li, Yidi Wang, Shu Yan, Chengxing Jia, Xu-Hui Liu, Xinwei Chen, Jiacheng Xu, and Yang Yu. A survey on large language models for mathematical reasoning.ACM Comput. Surv., 2026.https://dl.acm.org/doi/10.1145/ 3786333

2026

[61] [61]

The myth of expert specialization in moes: Why routing reflects geometry, not necessarily domain expertise, 2026.https://arxiv.org/abs/2604.09780

Xi Wang, Soufiane Hayou, and Eric Nalisnick. The myth of expert specialization in moes: Why routing reflects geometry, not necessarily domain expertise, 2026.https://arxiv.org/abs/2604.09780

Pith/arXiv arXiv 2026

[62] [62]

Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I. Wang. Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution, 2025.https://arxiv.org/abs/2502. 18449

2025

[63] [63]

Llamarl: A distributed asynchronous reinforcement learning framework for efficient large- scale llm training, 2025.https://arxiv.org/abs/2505.24034

Bo Wu, Sid Wang, Yunhao Tang, Jia Ding, Eryk Helenowski, Liang Tan, Tengyu Xu, Tushar Gowda, Zhengxing Chen, Chen Zhu, Xiaocheng Tang, Yundi Qian, Beibei Zhu, and Rui Hou. Llamarl: A distributed asynchronous reinforcement learning framework for efficient large- scale llm training, 2025.https://arxiv.org/abs/2505.24034

arXiv 2025

[64] [64]

Qwen3 technical report, 2025.https://arxiv.org/abs/2505.09388

An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report, 2025.https://arxiv.org/abs/2505.09388

Pith/arXiv arXiv 2025

[65] [65]

Your efficient rl framework secretly brings you off-policy rl training, 2025.https://fengyao.notion.site/off-policy-rl

Feng Yao, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. Your efficient rl framework secretly brings you off-policy rl training, 2025.https://fengyao.notion.site/off-policy-rl

2025

[66] [66]

Dapo: An open-source llm reinforcement learning system at scale, 2025.https://arxiv.org/ abs/2503.14476

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

Pith/arXiv arXiv 2025

[67] [67]

SmartMoE: Efficiently training Sparsely-Activated models through combining offline and online parallelization

Mingshu Zhai, Jiaao He, Zixuan Ma, Zan Zong, Runqing Zhang, and Jidong Zhai. SmartMoE: Efficiently training Sparsely-Activated models through combining offline and online parallelization. In2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023.https://www. usenix.org/conference/atc23/presentation/zhai

2023

[68] [68]

PopFetcher: Towards ac- celerated Mixture-of-Experts training via popularity based Expert- Wise prefetch

Junyi Zhang, Chuanhu Ma, Xiong Wang, Yuntao Nie, Yuqing Li, Yue- dong Xu, Xiaofei Liao, Bo Li, and Hai Jin. PopFetcher: Towards ac- celerated Mixture-of-Experts training via popularity based Expert- Wise prefetch. In2025 USENIX Annual Technical Conference (USENIX ATC 25), 2025.https://www.usenix.org/conference/atc25/presentation/ zhang-junyi

2025

[69] [69]

Comet: Fine- grained computation-communication overlapping for mixture-of- experts

Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wen- lei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, and Xin Liu. Comet: Fine- grained computation-communication overlapping for mixture-of- experts. InProceedings of Machine Learning and Systems, 2025.https://proceedings.mlsys.org/paper_files/paper/2025/file/ e27ea0...

2025

[70] [70]

Fine-grained moe load balancing with linear programming, 2026.https: //arxiv.org/abs/2511.16947

Chenqi Zhao, Wenfei Wu, Linhai Song, Yuchen Xu, and Yitao Yuan. Fine-grained moe load balancing with linear programming, 2026.https: //arxiv.org/abs/2511.16947

arXiv 2026

[71] [71]

Small leak can sink a great ship–boost rl training on moe with icepop!, 2025.https://ringtech.notion.site/icepop

Xin Zhao, Yongkang Liu, Kuan Xu, Jia Guo, Zihao Wang, Yan Sun, Xinyu Kong, Qianggang Cao, Liang Jiang, Zujie Wen, Zhiqiang Zhang, and Jun Zhou. Small leak can sink a great ship–boost rl training on moe with icepop!, 2025.https://ringtech.notion.site/icepop

2025

[72] [72]

doi: 10.14778/3611540.3611569.https: //doi.org/10.14778/3611540.3611569

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. Pytorch fsdp: Experiences on scaling fully sharded data parallel.Proc. VLDB Endow., 2023.https://doi.org/10.147...

work page doi:10.14778/3611540.3611569 2023

[73] [73]

Stabilizing reinforcement learning with llms: Formulation and practices, 2025.https://arxiv.org/abs/2512

Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Jun- rong Lin, Yuqiong Liu, Hao Lin, Chencan Wu, Feng Hu, An Yang, Jingren Zhou, and Junyang Lin. Stabilizing reinforcement learning with llms: Formulation and practices, 2025.https://arxiv.org/abs/2512. 01374

2025

[74] [74]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: efficient execution of structured language model programs. InProceedings of the 38th International Conference on Neural Information Processing Systems, 2024.https://dl.acm.o...

work page doi:10.5555/3737916.3739916 2024

[75] [75]

Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation, 2025.https://arxiv.org/abs/2504.15930

Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, Hongyu Zhou, Yimin Jiang, Yibo Zhu, and Daxin Jiang. Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation, 2025.https://arxiv.org/abs/2504.15930. 15

arXiv 2025

[76] [76]

the gradient of 𝑒

Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, zhifeng Chen, Quoc V Le, and James Laudon. Mixture-of-experts with expert choice rout- ing. InAdvances in Neural Information Processing Systems, 2022.https://proceedings.neurips.cc/paper_files/paper/2022/file/ 2f00ecd787b432c1d36f3de9800728eb-Paper-Conference.pdf. 16 A De...

2022