Scalable and Efficient MoE Training for Multitask Multilingual Models

Alexandre Muzio; Ammar Ahmad Awan; Amr Hendy; Andres Felipe Cruz Salinas; Hany Hassan Awadalla; Liyang Lu; Samyam Rajbhandari; Young Jin Kim; Yuxiong He

arxiv: 2109.10465 · v1 · pith:NMB7QKC2new · submitted 2021-09-22 · 💻 cs.CL · cs.AI· cs.LG

Scalable and Efficient MoE Training for Multitask Multilingual Models

Young Jin Kim , Ammar Ahmad Awan , Alexandre Muzio , Andres Felipe Cruz Salinas , Liyang Lu , Amr Hendy , Samyam Rajbhandari , Yuxiong He

show 1 more author

Hany Hassan Awadalla

This is my paper

classification 💻 cs.CL cs.AIcs.LG

keywords modelssystemtrainingefficiencyefficientmodelmultilingualparameters

0 comments

read the original abstract

The Mixture of Experts (MoE) models are an emerging class of sparsely activated deep learning models that have sublinear compute costs with respect to their parameters. In contrast with dense models, the sparse architecture of MoE offers opportunities for drastically growing model size with significant accuracy gain while consuming much lower compute budget. However, supporting large scale MoE training also has its own set of system and modeling challenges. To overcome the challenges and embrace the opportunities of MoE, we first develop a system capable of scaling MoE models efficiently to trillions of parameters. It combines multi-dimensional parallelism and heterogeneous memory technologies harmoniously with MoE to empower 8x larger models on the same hardware compared with existing work. Besides boosting system efficiency, we also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve inference time efficiency. By combining the efficient system and training methods, we are able to significantly scale up large multitask multilingual models for language generation which results in a great improvement in model accuracy. A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks. The system support of efficient MoE training has been implemented and open-sourced with the DeepSpeed library.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EPnG: Adaptive Expert Prune-and-Grow for Parameter-Efficient MoE Fine-tuning
cs.LG 2026-07 unverdicted novelty 6.0

EPnG reallocates LoRA capacity in MoE models by pruning experts with low router gate probabilities and expanding high-importance ones via rank growth, outperforming standard LoRA and nearing full fine-tuning performan...
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
cs.LG 2023-09 accept novelty 6.0

DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
On the Utility and Factual Reliability of Pruned Mixture-of-Experts Models in the Biomedical Domain
cs.LG 2026-07 unverdicted novelty 4.0

Moderate pruning of MoE models preserves in-domain biomedical utility and reliability but both degrade rapidly in cross-domain settings and at extreme pruning ratios.
The Energy Consumption of Transformer Fine-Tuning: A Roofline-Inspired Scaling Model
cs.LG 2026-06 unverdicted novelty 4.0

A scaling law model derived from roofline analysis and a speedup-based efficiency factor predicts training energy for BERT models across GPU parallelism configurations.