Harnessing Routing Foresight for Micro-step-level MoE load balancing in RL Post-training
Pith reviewed 2026-06-27 08:33 UTC · model grok-4.3
The pith
ForeMoE uses routing foresight from the rollout stage to balance MoE experts at micro-step level in RL post-training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ForeMoE is a micro-step-level load balancing system for MoE RL post-training that exploits foreseeable routing information from the rollout stage to proactively guide expert placement and transfers in the recompute and policy update stages. It uses a hierarchical planner to decompose the NP-hard balancing problem and a transfer engine that leverages complementary CPU-assisted and GPU-direct hardware paths for overlapped expert movement, enabling per-micro-step reconfiguration without dependence on historical statistics.
What carries the argument
Foreseeable routing information from the rollout stage, which proactively guides the hierarchical planner and transfer engine for micro-step expert load balancing.
If this is right
- Load balancing operates at micro-step granularity instead of step-level granularity.
- The system achieves up to 1.45× speedup over state-of-the-art RL post-training systems on 64 GPUs.
- Frequent per-micro-step reconfiguration becomes feasible through decomposed planning and overlapped transfers.
- Balancing no longer depends on historical step-level statistics.
Where Pith is reading between the lines
- The cross-stage foresight pattern could extend to other multi-stage training or serving pipelines where early routing or scheduling decisions inform later resource allocation.
- Similar techniques might reduce reliance on complex historical modeling when workloads exhibit stable coarse-grained but variable fine-grained behavior.
- The approach suggests testing whether routing foresight improves balancing in non-RL MoE settings with comparable pipeline structure.
Load-bearing premise
Routing decisions produced during the rollout stage remain sufficiently accurate and low-overhead to usefully guide expert placement and transfer decisions in the subsequent recompute and policy update stages.
What would settle it
A workload where rollout-stage routing decisions diverge substantially from actual loads in later stages, producing higher imbalance or slower training than historical-statistic baselines on the same RL post-training pipeline.
Figures
read the original abstract
Mixture-of-Experts (MoE) and reinforcement learning (RL) post-training now dominate large language model (LLM) development, yet expert load imbalance remains a critical challenge. Existing load-balancing systems target pre-training by relying on historical step-level statistics. However, these methods fail under the unique workload dynamics of RL post-training: the step-level load is stable, but the tiny batch sizes processed during micro-steps cause severe, high-frequency load fluctuations. We introduce ForeMoE, a micro-step-level load balancing system for MoE RL post-training. Instead of relying on historical statistics, ForeMoE exploits the multi-stage RL pipeline (rollout, recompute, policy update) by using foreseeable routing information from the rollout stage to proactively guide load balancing in the remaining stages. To support frequent per-micro-step reconfiguration, ForeMoE employs a hierarchical planner that decomposes the NP-hard load balancing problem into tractable sub-components, alongside a transfer engine that leverages complementary hardware paths (CPU-assisted and GPU-direct) for overlapped expert transfer. Evaluations on 64 GPUs demonstrate that ForeMoE achieves up to a 1.45$\times$ speedup over state-of-the-art RL post-training systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ForeMoE, a micro-step-level load balancing system for MoE models during RL post-training. It exploits the multi-stage pipeline (rollout, recompute, policy update) by using routing decisions from the rollout stage to proactively configure expert placement and transfers in later stages, employing a hierarchical planner to decompose the NP-hard balancing problem and a transfer engine that overlaps CPU-assisted and GPU-direct paths. On 64 GPUs it reports up to 1.45× speedup over existing RL post-training systems.
Significance. If the routing-foresight assumption holds and the reported speedup is reproducible, the work would provide a concrete systems technique for handling high-frequency load imbalance that is characteristic of RL post-training but not pre-training, potentially improving training throughput for large MoE models without requiring changes to the RL algorithm itself.
major comments (2)
- [Abstract] Abstract: the 1.45× speedup is stated without any description of the baselines, workloads, number of runs, or variance; this information is required to assess whether the result supports the central claim.
- [Abstract] Abstract: no routing-overlap statistics, load-prediction error, or ablation that isolates the foresight component are supplied, leaving the key assumption—that rollout-stage assignments remain sufficiently stable for recompute and update micro-steps—unsupported despite being load-bearing for the performance argument under the described tiny-batch, high-frequency regime.
minor comments (1)
- [Abstract] The abstract introduces the terms 'hierarchical planner' and 'transfer engine' without a one-sentence gloss or pointer to the section that defines them.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below and indicate planned revisions to the abstract and manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the 1.45× speedup is stated without any description of the baselines, workloads, number of runs, or variance; this information is required to assess whether the result supports the central claim.
Authors: We agree that the abstract would benefit from additional context. In the revision we will update the abstract to briefly identify the baselines as state-of-the-art RL post-training systems that rely on historical step-level statistics, note the workloads as tiny-batch MoE RL post-training on 64 GPUs, and state that the 1.45× figure is the maximum observed speedup with full run counts and variance reported in the evaluation section. revision: yes
-
Referee: [Abstract] Abstract: no routing-overlap statistics, load-prediction error, or ablation that isolates the foresight component are supplied, leaving the key assumption—that rollout-stage assignments remain sufficiently stable for recompute and update micro-steps—unsupported despite being load-bearing for the performance argument under the described tiny-batch, high-frequency regime.
Authors: We acknowledge the abstract itself does not contain these supporting figures. The manuscript's evaluation section demonstrates the end-to-end benefit of rollout-stage foresight via the reported speedups. We will revise the abstract to reference the observed stability of routing decisions across pipeline stages and will ensure the body includes explicit routing-overlap statistics, load-prediction error measurements, and an ablation isolating the foresight component. revision: yes
Circularity Check
No circularity: empirical systems claim with no self-referential derivation or fitting
full rationale
The paper presents ForeMoE as a systems artifact that exploits rollout-stage routing to guide later pipeline stages, evaluated empirically on 64 GPUs for a 1.45× speedup. No equations, fitted parameters, or mathematical derivations are described in the abstract or claimed chain; the central result is an end-to-end performance measurement rather than a prediction derived from its own inputs. The stability assumption noted by the skeptic is an empirical precondition, not a self-definition or fitted-input reduction. No self-citations are invoked as load-bearing uniqueness theorems. This is a standard non-circular empirical systems paper.
Axiom & Free-Parameter Ledger
invented entities (2)
-
hierarchical planner
no independent evidence
-
transfer engine
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Deepep: A high-performance communication library, 2025.https: //github.com/deepseek-ai/DeepEP
2025
-
[2]
Expert parallelism load balancer, 2025.https://github.com/deepseek- ai/EPLB
2025
-
[3]
A Survey on Mixture of Experts in Large Language Models , ISSN=
Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts in large language models. IEEE Transactions on Knowledge and Data Engineering, 2025.http: //dx.doi.org/10.1109/TKDE.2025.3554028
-
[4]
Qiaoling Chen, Zijun Liu, Peng Sun, Shenggui Li, Guoteng Wang, Ziming Liu, Yonggang Wen, Siyuan Feng, and Tianwei Zhang. Respec: Towards optimizing speculative decoding in reinforcement learning systems, 2025.https://arxiv.org/abs/2510.26475
arXiv 2025
-
[5]
DeepSeek-AI, Daya Guo, Dejian Yang, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025.https: //arxiv.org/abs/2501.12948
Pith/arXiv arXiv 2025
-
[6]
Deepseek-v3 technical report, 2025.https://arxiv.org/abs/2412.19437
DeepSeek-AI, Aixin Liu, Bei Feng, et al. Deepseek-v3 technical report, 2025.https://arxiv.org/abs/2412.19437
Pith/arXiv arXiv 2025
-
[7]
GLaM: Efficient scaling of language models with mixture-of-experts
Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lep- ikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P Bosma, Zongwei Zhou, Tao Wang, Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc Le, Yonghui Wu, Zhifeng Chen, and...
2022
-
[8]
Dapo-math-17k dataset, 2025.https://huggingface.co/ datasets/BytedTsinghua-SIA/DAPO-Math-17k
Hugging Face. Dapo-math-17k dataset, 2025.https://huggingface.co/ datasets/BytedTsinghua-SIA/DAPO-Math-17k
2025
-
[9]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 2022.http://jmlr.org/papers/ v23/21-0998.html
2022
-
[10]
Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu. Areal: A large-scale asynchronous reinforcement learning system for language reasoning, 2025.https://arxiv.org/abs/ 2505.24298
Pith/arXiv arXiv 2025
-
[11]
R. L. Graham. Bounds on multiprocessing timing anomalies.SIAM J. Appl. Math., 1969.https://doi.org/10.1137/0117039
-
[12]
Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, Huazhong Ji, Wenjie Liu, Yu Huang, Yixiang Zhang, Chenyi Pan, Jing Wang, Xin Huang, Chunsheng Li, and Jianping Wu. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training, 2025.https://arxiv.org/abs/2507.01663
arXiv 2025
-
[13]
Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. Fastermoe: modeling and optimizing training of large-scale dynamic pre-trained models. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2022.https://doi.org/10.1145/3503221.3508418
-
[14]
Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, and Haibo Chen. History rhymes: Accelerating llm reinforcement learning with rhymerl, 2025.https://arxiv.org/abs/2508.18588
arXiv 2025
-
[15]
Openrlhf: An easy-to-use, scalable and high-performance rlhf framework, 2025
Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Weikai Fang, Xianyu, Yu Cao, Haotian Xu, and Yiming Liu. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework, 2025. https://arxiv.org/abs/2405.11143
Pith/arXiv arXiv 2025
-
[16]
Zhiyi Hu, Siyuan Shen, Tommaso Bonato, Sylvain Jeaugey, Cedell Alexander, Eric Spada, James Dinan, Jeff Hammond, and Torsten Hoe- fler. Demystifying nccl: An in-depth analysis of gpu communication protocols and algorithms, 2025.https://arxiv.org/abs/2507.04786
arXiv 2025
-
[17]
Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xi- aojuan Qi, Song Han, and Yukang Chen. Qerl: Beyond efficiency – quantization-enhanced reinforcement learning for llms, 2025.https: //arxiv.org/abs/2510.11696
arXiv 2025
-
[18]
Q. Huangfu and J. A. J. Hall. Parallelizing the dual revised simplex method.Mathematical Programming Computation, 2018.https://doi. org/10.1007/s12532-017-0130-5. 13
-
[19]
Tutel: Adap- tive mixture-of-experts at scale
Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, HoYuen Chau, Peng Cheng, Fan Yang, Mao Yang, and Yongqiang Xiong. Tutel: Adap- tive mixture-of-experts at scale. InProceedings of Machine Learning and Systems, 2023.https://proceedings.mlsys.org/paper_files/paper/ 2023/file/5616d34cf8ff739...
2023
-
[20]
Coderl+: Improving code genera- tion via reinforcement with execution semantics alignment, 2026
Xue Jiang, Yihong Dong, Mengyang Liu, Hongyi Deng, Tian Wang, Yongding Tao, Rongyu Cao, Binhua Li, Zhi Jin, Wenpin Jiao, Fei Huang, Yongbin Li, and Ge Li. Coderl+: Improving code genera- tion via reinforcement with execution semantics alignment, 2026. https://arxiv.org/abs/2510.18471
Pith/arXiv arXiv 2026
-
[21]
Chao Jin, Xinming Wei, Yinmin Zhong, Chengxu Yang, Bingyang Wu, Ruidong Zhu, Zili Zhang, Yuliang Liu, and Xin Jin. Relibra: Routing- replay-guided load balancing for moe training in reinforcement learn- ing, 2026.https://arxiv.org/abs/2605.08639
Pith/arXiv arXiv 2026
-
[22]
Towards democratizing LLMs: Investigating mul- tilingual mixture-of-experts models
Aditi Khandelwal, Marius Mosbach, Verna Dankers, Siva Reddy, and Golnoosh Farnadi. Towards democratizing LLMs: Investigating mul- tilingual mixture-of-experts models. InWomen in Machine Learning Workshop @ NeurIPS 2025, 2026.https://openreview.net/forum?id= Bwf4grCk3H
2025
-
[23]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, 2023.https://doi.org/10.1145/3600006.3613165
-
[24]
{GS}hard: Scaling giant models with conditional com- putation and automatic sharding
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. {GS}hard: Scaling giant models with conditional com- putation and automatic sharding. InInternational Conference on Learning Representations, 2021.https://openreview.net/forum?id= qrwe7XHTmYb
2021
-
[25]
Base layers: Simplifying training of large, sparse models
Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. Base layers: Simplifying training of large, sparse models. InProceedings of the 38th International Conference on Machine Learning, 2021.https://proceedings.mlr.press/v139/lewis21a.html
2021
-
[26]
Haoyang Li, Sheng Lin, Fangcheng Fu, Yuming Zhou, Xiaodong Ji, Yanfeng Zhao, Lefeng Wang, Jie Jiang, and Bin Cui. Unleashing effi- cient asynchronous rl post-training via staleness-constrained rollout coordination, 2026.https://arxiv.org/abs/2601.12784
arXiv 2026
-
[27]
Bingshuai Liu, Ante Wang, Zijun Min, Liang Yao, Haibo Zhang, Yang Liu, Xu Han, Peng Li, Anxiang Zeng, and Jinsong Su. Spec-rl: Accel- erating on-policy reinforcement learning with speculative rollouts, 2026.https://arxiv.org/abs/2509.23232
arXiv 2026
-
[28]
Jiacai Liu, Yingru Li, Yuqian Fu, Jiawei Wang, Qian Liu, and Yu Shen. When speed kills stability: Demystifying rl collapse from the inference- training mismatch, 2025.https://yingru.notion.site/When-Speed- Kills-Stability-Demystifying-RL-Collapse-from-the-Inference- Training-Mismatch-271211a558b7808d8b12d403fd15edda
2025
-
[29]
Flashrl: 8bit rollouts, full power rl, 2025.https: //fengyao.notion.site/flash-rl
Liyuan Liu, Feng Yao, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. Flashrl: 8bit rollouts, full power rl, 2025.https: //fengyao.notion.site/flash-rl
2025
-
[30]
Laer-moe: Load-adaptive expert re-layout for efficient mixture-of-experts training
Xinyi Liu, Yujie Wang, Fangcheng Fu, Xuefeng Xiao, Huixia Li, Ji- ashi Li, and Bin Cui. Laer-moe: Load-adaptive expert re-layout for efficient mixture-of-experts training. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2026.https://doi.org/10. 1145/3779212.3790180
arXiv 2026
-
[31]
Han Lu, Zichen Liu, Shaopan Xiong, Yancheng He, Wei Gao, Yanan Wu, Weixun Wang, Jiashun Liu, Yang Li, Haizhou Zhao, Ju Huang, Siran Yang, Xiaoyang Li, Yijia Luo, Zihe Liu, Ling Pan, Junchi Yan, Wei Wang, Wenbo Su, Jiamang Wang, Lin Qu, and Bo Zheng. Part ii: Roll flash – accelerating rlvr and agentic training with asynchrony, 2025.https://arxiv.org/abs/2510.11345
arXiv 2025
-
[32]
Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, and Fuli Luo. Stabilizing moe reinforcement learning by aligning training and inference routers, 2025.https://arxiv.org/abs/ 2510.11370
arXiv 2025
-
[33]
Prasanna Mayilvahanan, Ricardo Dominguez-Olmedo, Thaddäus Wiedemer, and Wieland Brendel. Math-beyond: A benchmark for rl to expand beyond the base model, 2025.https://arxiv.org/abs/2510.11653
arXiv 2025
-
[34]
Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: a distributed framework for emerging ai applications. InProceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, 2018.https://dl.acm. org/doi/10.5555...
-
[35]
Siyuan Mu and Sen Lin. A comprehensive survey of mixture-of- experts: Algorithms, theory, and applications, 2025.https://arxiv.org/ abs/2503.07137
arXiv 2025
-
[36]
Pipedream: generalized pipeline parallelism for dnn training,
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. Pipedream: generalized pipeline parallelism for dnn train- ing. InProceedings of the 27th ACM Symposium on Operating Systems Principles, 2019.https://doi.org/10.1145/3341301.3359646
-
[37]
Memory-efficient pipeline-parallel dnn training
Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. Memory-efficient pipeline-parallel dnn training. In Proceedings of the 38th International Conference on Machine Learning, 2021.https://proceedings.mlr.press/v139/narayanan21a.html
2021
-
[38]
Efficient large-scale language model training on gpu clusters using megatron-lm,
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm. InProceedings of the International Conference for High Perfo...
-
[39]
Flexmoe: Scaling large-scale sparse pre-trained model training via dynamic device placement.Proc
Xiaonan Nie, Xupeng Miao, Zilong Wang, Zichao Yang, Jilong Xue, Lingxiao Ma, Gang Cao, and Bin Cui. Flexmoe: Scaling large-scale sparse pre-trained model training via dynamic device placement.Proc. ACM Manag. Data, 2023.https://doi.org/10.1145/3588964
-
[40]
Asynchronous rlhf: Faster and more efficient off-policy rl for language models, 2025
Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hos- seini, Rishabh Agarwal, and Aaron Courville. Asynchronous rlhf: Faster and more efficient off-policy rl for language models, 2025. https://arxiv.org/abs/2410.18252
arXiv 2025
-
[41]
Nvidia collective communication library (nccl) documen- tation, 2025.https://docs.nvidia.com/deeplearning/nccl/user-guide/ docs/index.html
NVIDIA. Nvidia collective communication library (nccl) documen- tation, 2025.https://docs.nvidia.com/deeplearning/nccl/user-guide/ docs/index.html
2025
-
[42]
Codeforces, 2025.https://huggingface.co/datasets/open-r1/codeforces
Guilherme Penedo, Anton Lozhkov, Hynek Kydlíček, Loubna Ben Allal, Edward Beeching, Agustín Piqueres Lajarín, Quentin Gallouédec, Nathan Habib, Lewis Tunstall, and Leandro von Werra. Codeforces, 2025.https://huggingface.co/datasets/open-r1/codeforces
2025
-
[43]
Ruoyu Qin, Weiran He, Weixiao Huang, Yangkun Zhang, Yikai Zhao, Bo Pang, Xinran Xu, Yingdi Shan, Yongwei Wu, and Mingxing Zhang. Seer: Online context learning for fast synchronous llm reinforcement learning, 2025.https://arxiv.org/abs/2511.14617
Pith/arXiv arXiv 2025
-
[44]
Qwen2.5 technical report, 2025.https://arxiv.org/abs/2412.15115
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, et al. Qwen2.5 technical report, 2025.https://arxiv.org/abs/2412.15115
Pith/arXiv arXiv 2025
-
[45]
Tyrell Rockafellar.Convex Analysis
R. Tyrell Rockafellar.Convex Analysis. Princeton University Press, 1970.http://www.jstor.org/stable/j.ctt14bs1ff
1970
-
[46]
Hash layers for large sparse models
Stephen Roller, Sainbayar Sukhbaatar, arthur szlam, and Jason Weston. Hash layers for large sparse models. InAdvances in Neural Information Processing Systems, 2021.https://proceedings.neurips.cc/paper_files/ paper/2021/file/92bf5e6240737e0326ea59846a83e076-Paper.pdf
2021
-
[47]
Proximal policy optimization algorithms, 2017.https: //arxiv.org/abs/1707.06347
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017.https: //arxiv.org/abs/1707.06347
Pith/arXiv arXiv 2017
-
[48]
Zelei Shao, Vikranth Srivatsa, Sanjana Srivastava, Qingyang Wu, Al- pay Ariyak, Xiaoxia Wu, Ameen Patel, Jue Wang, Percy Liang, Tri 14 Dao, Ce Zhang, Yiying Zhang, Ben Athiwaratkun, Chenfeng Xu, and Junxiong Wang. Beat the long tail: Distribution-aware speculative decoding for rl training, 2025.https://arxiv.org/abs/2511.13841
arXiv 2025
-
[49]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.https://arxiv.org/abs/2402.03300
Pith/arXiv arXiv 2024
-
[50]
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, 2017.https: //arxiv.org/abs/1701.06538
Pith/arXiv arXiv 2017
-
[51]
Laminar: A scalable asynchronous rl post-training framework, 2025.https://arxiv.org/abs/2510.12633
Guangming Sheng, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, Xiang Li, Chi Zhang, Yanghua Peng, Haibin Lin, Xin Liu, and Chuan Wu. Laminar: A scalable asynchronous rl post-training framework, 2025.https://arxiv.org/abs/2510.12633
arXiv 2025
-
[52]
Hybridflow: A flexible and efficient rlhf framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, 2025.http://dx.doi.org/10. 1145/3689031.3696075
arXiv 2025
-
[53]
Anycostfl: Efficient on-demand federated learning over heterogeneous edge devices
Shaohuai Shi, Xinglin Pan, Xiaowen Chu, and Bo Li. Pipemoe: Ac- celerating mixture-of-experts through adaptive pipelining. InIEEE INFOCOM 2023 - IEEE Conference on Computer Communications, 2023. https://doi.org/10.1109/INFOCOM53939.2023.10228874
-
[54]
Donaldson, John Wickerson, and Manuel Rigger
Shaohuai Shi, Xinglin Pan, Qiang Wang, Chengjian Liu, Xiaozhe Ren, Zhongzhe Hu, Yu Yang, Bo Li, and Xiaowen Chu. Schemoe: An ex- tensible mixture-of-experts distributed training system with tasks scheduling. InProceedings of the Nineteenth European Conference on Computer Systems, 2024.https://doi.org/10.1145/3627703.3650083
-
[55]
Megatron-lm: Training multi- billion parameter language models using model parallelism, 2020
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi- billion parameter language models using model parallelism, 2020. https://arxiv.org/abs/1909.08053
Pith/arXiv arXiv 2020
-
[56]
SYMI: Efficient Mixture-of-Experts training via model and optimizer state decoupling
Athinagoras Skiadopoulos, Mark Zhao, Swapnil Gandhi, Thomas Norrie, Shrijeet Mukherjee, and Christos Kozyrakis. SYMI: Efficient Mixture-of-Experts training via model and optimizer state decoupling. In23rd USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 26), 2026.https://www.usenix.org/conference/nsdi26/ presentation/skiadopoulos
2026
-
[57]
Kimi k2.5: Visual agentic intelligence, 2026.https://arxiv.org/abs/2602.02276
Kimi Team, Tongtong Bai, Yifan Bai, et al. Kimi k2.5: Visual agentic intelligence, 2026.https://arxiv.org/abs/2602.02276
Pith/arXiv arXiv 2026
-
[58]
Kimi k2: Open agentic intelli- gence, 2025.https://arxiv.org/abs/2507.20534
Kimi Team, Yifan Bai, Yiping Bao, et al. Kimi k2: Open agentic intelli- gence, 2025.https://arxiv.org/abs/2507.20534
Pith/arXiv arXiv 2025
-
[59]
Philippe Tillet, H. T. Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InPro- ceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 2019.https://doi.org/10.1145/ 3315508.3329973
arXiv 2019
-
[60]
A survey on large language models for mathematical reasoning.ACM Comput
Peng-Yuan Wang, Tian-Shuo Liu, Chenyang Wang, Ziniu Li, Yidi Wang, Shu Yan, Chengxing Jia, Xu-Hui Liu, Xinwei Chen, Jiacheng Xu, and Yang Yu. A survey on large language models for mathematical reasoning.ACM Comput. Surv., 2026.https://dl.acm.org/doi/10.1145/ 3786333
2026
-
[61]
Xi Wang, Soufiane Hayou, and Eric Nalisnick. The myth of expert specialization in moes: Why routing reflects geometry, not necessarily domain expertise, 2026.https://arxiv.org/abs/2604.09780
Pith/arXiv arXiv 2026
-
[62]
Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I. Wang. Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution, 2025.https://arxiv.org/abs/2502. 18449
2025
-
[63]
Bo Wu, Sid Wang, Yunhao Tang, Jia Ding, Eryk Helenowski, Liang Tan, Tengyu Xu, Tushar Gowda, Zhengxing Chen, Chen Zhu, Xiaocheng Tang, Yundi Qian, Beibei Zhu, and Rui Hou. Llamarl: A distributed asynchronous reinforcement learning framework for efficient large- scale llm training, 2025.https://arxiv.org/abs/2505.24034
arXiv 2025
-
[64]
Qwen3 technical report, 2025.https://arxiv.org/abs/2505.09388
An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report, 2025.https://arxiv.org/abs/2505.09388
Pith/arXiv arXiv 2025
-
[65]
Your efficient rl framework secretly brings you off-policy rl training, 2025.https://fengyao.notion.site/off-policy-rl
Feng Yao, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. Your efficient rl framework secretly brings you off-policy rl training, 2025.https://fengyao.notion.site/off-policy-rl
2025
-
[66]
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...
Pith/arXiv arXiv 2025
-
[67]
SmartMoE: Efficiently training Sparsely-Activated models through combining offline and online parallelization
Mingshu Zhai, Jiaao He, Zixuan Ma, Zan Zong, Runqing Zhang, and Jidong Zhai. SmartMoE: Efficiently training Sparsely-Activated models through combining offline and online parallelization. In2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023.https://www. usenix.org/conference/atc23/presentation/zhai
2023
-
[68]
PopFetcher: Towards ac- celerated Mixture-of-Experts training via popularity based Expert- Wise prefetch
Junyi Zhang, Chuanhu Ma, Xiong Wang, Yuntao Nie, Yuqing Li, Yue- dong Xu, Xiaofei Liao, Bo Li, and Hai Jin. PopFetcher: Towards ac- celerated Mixture-of-Experts training via popularity based Expert- Wise prefetch. In2025 USENIX Annual Technical Conference (USENIX ATC 25), 2025.https://www.usenix.org/conference/atc25/presentation/ zhang-junyi
2025
-
[69]
Comet: Fine- grained computation-communication overlapping for mixture-of- experts
Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wen- lei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, and Xin Liu. Comet: Fine- grained computation-communication overlapping for mixture-of- experts. InProceedings of Machine Learning and Systems, 2025.https://proceedings.mlsys.org/paper_files/paper/2025/file/ e27ea0...
2025
-
[70]
Fine-grained moe load balancing with linear programming, 2026.https: //arxiv.org/abs/2511.16947
Chenqi Zhao, Wenfei Wu, Linhai Song, Yuchen Xu, and Yitao Yuan. Fine-grained moe load balancing with linear programming, 2026.https: //arxiv.org/abs/2511.16947
arXiv 2026
-
[71]
Small leak can sink a great ship–boost rl training on moe with icepop!, 2025.https://ringtech.notion.site/icepop
Xin Zhao, Yongkang Liu, Kuan Xu, Jia Guo, Zihao Wang, Yan Sun, Xinyu Kong, Qianggang Cao, Liang Jiang, Zujie Wen, Zhiqiang Zhang, and Jun Zhou. Small leak can sink a great ship–boost rl training on moe with icepop!, 2025.https://ringtech.notion.site/icepop
2025
-
[72]
doi: 10.14778/3611540.3611569.https: //doi.org/10.14778/3611540.3611569
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. Pytorch fsdp: Experiences on scaling fully sharded data parallel.Proc. VLDB Endow., 2023.https://doi.org/10.147...
-
[73]
Stabilizing reinforcement learning with llms: Formulation and practices, 2025.https://arxiv.org/abs/2512
Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Jun- rong Lin, Yuqiong Liu, Hao Lin, Chencan Wu, Feng Hu, An Yang, Jingren Zhou, and Junyang Lin. Stabilizing reinforcement learning with llms: Formulation and practices, 2025.https://arxiv.org/abs/2512. 01374
2025
-
[74]
Gonzalez, Clark Barrett, and Ying Sheng
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: efficient execution of structured language model programs. InProceedings of the 38th International Conference on Neural Information Processing Systems, 2024.https://dl.acm.o...
-
[75]
Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, Hongyu Zhou, Yimin Jiang, Yibo Zhu, and Daxin Jiang. Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation, 2025.https://arxiv.org/abs/2504.15930. 15
arXiv 2025
-
[76]
the gradient of 𝑒
Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, zhifeng Chen, Quoc V Le, and James Laudon. Mixture-of-experts with expert choice rout- ing. InAdvances in Neural Information Processing Systems, 2022.https://proceedings.neurips.cc/paper_files/paper/2022/file/ 2f00ecd787b432c1d36f3de9800728eb-Paper-Conference.pdf. 16 A De...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.