Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
hub
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
55 Pith papers cite this work. Polarity classification is still indexing.
abstract
Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.
hub tools
claims ledger
- abstract Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minim
co-cited works
representative citing papers
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
BatMIL uses hybrid hyperbolic-Euclidean geometry, an S4 state-space backbone, and chunk-level mixture-of-experts to outperform prior multiple-instance learning methods on seven whole-slide image datasets across six cancers.
Approximate multipliers degrade MoE and dense DNNs at different rates; ResNet-20 recovers fully after retraining while VGG models often fail at aggressive approximations except Cluster MoE, and Hard MoE can outperform dense on ViT under cost-matched aggressive approximation.
Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.
A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.
FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.
Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
A mixture-of-experts transformer foundation model pretrained on diverse SEM images enables generalization across materials and outperforms SOTA on unsupervised defocus-to-focus restoration.
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
Localized model averaging with covariate-dependent weights achieves asymptotic optimality and weight consistency for combining pre-trained models under a general loss framework.
DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.
XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.
PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
DIMoE-Adapters uses self-calibrated expert evolution and prototype-guided selection to dynamically grow and allocate experts, outperforming prior continual learning methods on vision-language models.
Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and vision tasks.
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
citing papers explorer
-
Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
-
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
-
Geometry-Aware State Space Model: A New Paradigm for Whole-Slide Image Representation
BatMIL uses hybrid hyperbolic-Euclidean geometry, an S4 state-space backbone, and chunk-level mixture-of-experts to outperform prior multiple-instance learning methods on seven whole-slide image datasets across six cancers.
-
AxMoE: Characterizing the Impact of Approximate Multipliers on Mixture-of-Experts DNN Architectures
Approximate multipliers degrade MoE and dense DNNs at different rates; ResNet-20 recovers fully after retraining while VGG models often fail at aggressive approximations except Cluster MoE, and Hard MoE can outperform dense on ViT under cost-matched aggressive approximation.
-
Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs
Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.
-
Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning
A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.
-
FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training
FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.
-
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
-
Depth Adaptive Efficient Visual Autoregressive Modeling
DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
-
A Mixture of Experts Foundation Model for Scanning Electron Microscopy Image Analysis
A mixture-of-experts transformer foundation model pretrained on diverse SEM images enables generalization across materials and outperforms SOTA on unsupervised defocus-to-focus restoration.
-
Combining pre-trained models via localized model averaging
Localized model averaging with covariate-dependent weights achieves asymptotic optimality and weight consistency for combining pre-trained models under a general loss framework.
-
Enabling Performant and Flexible Model-Internal Observability for LLM Inference
DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
-
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism
DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.
-
XPERT: Expert Knowledge Transfer for Effective Training of Language Models
XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.
-
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression
PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
-
DIMoE-Adapters: Dynamic Expert Evolution for Continual Learning in Vision-Language Models
DIMoE-Adapters uses self-calibrated expert evolution and prototype-guided selection to dynamically grow and allocate experts, outperforming prior continual learning methods on vision-language models.
-
Hierarchical Mixture-of-Experts with Two-Stage Optimization
Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and vision tasks.
-
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
-
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
-
Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs
DySHARP accelerates MoE expert parallelism via dynamic multimem addressing and token-centric kernel fusion to cut redundant traffic and deliver up to 1.79x speedup over prior in-switch solutions.
-
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism
Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.
-
Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models
MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.
-
ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.
-
Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding
EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.
-
SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs
SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.
-
Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling
X-GRAM applies data-aware dynamic token injection with hybrid hashing and local feature extraction to achieve up to 4.4 accuracy point gains over vanilla backbones and 3.2 over retrieval baselines at 0.73B-1.15B scales using 50% smaller tables.
-
Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs
NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better energy efficiency.
-
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
-
WiFo-MiSAC: A Wireless Foundation Model for Multimodal Sensing and Communication Integration via Synesthesia of Machines (SoM)
WiFo-MiSAC is a task-agnostic foundation model that unifies multimodal wireless signals via tokenization and self-supervised learning with SS-DMoE to achieve strong few-shot performance on beam prediction and channel estimation.
-
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.
-
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
-
ST-MoE: Designing Stable and Transferable Sparse Expert Models
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.
-
FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving
FaaSMoE treats MoE experts as on-demand FaaS functions with configurable granularity, using under one-third the resources of a full-model baseline under multi-tenant workloads.
-
Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling
Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.
-
M-IDoL: Information Decomposition for Modality-Specific and Diverse Representation Learning in Medical Foundation Model
M-IDoL learns modality-specific and diverse representations by maximizing inter-modality entropy and minimizing intra-modality uncertainty through information decomposition in MoE subspaces.
-
HQF-Net: A Hybrid Quantum-Classical Multi-Scale Fusion Network for Remote Sensing Image Segmentation
HQF-Net reports mIoU gains on three remote-sensing benchmarks by adding quantum circuits to skip connections and a mixture-of-experts bottleneck inside a classical U-Net fused with a DINOv3 backbone.
-
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency
JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.
-
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
-
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI releases two open-weight reasoning models, gpt-oss-120b and gpt-oss-20b, trained via distillation and RL with claimed strong results on math, coding, and safety benchmarks.
-
Kimi K2: Open Agentic Intelligence
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
-
ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism
ResiHP introduces a workload-aware failure detector and dynamic scheduler for hybrid-parallel LLM training that achieves 1.04-4.39x higher throughput than prior resilient systems under failures on a 256-GPU cluster.
-
Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance
AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.
-
Enhancing Online Recruitment with Category-Aware MoE and LLM-based Data Augmentation
LLM chain-of-thought rewriting of job postings plus category-aware MoE improves person-job fit AUC by 2.4%, GAUC by 7.5%, and live click-through conversion by 19.4%.
-
Quadruped Parkour Learning: Sparsely Gated Mixture of Experts with Visual Input
Sparsely gated MoE policies double the success rate of a real Unitree Go2 quadruped on large-obstacle parkour versus matched-active-parameter MLP baselines while cutting inference time compared with a scaled-up MLP.
-
Efficient Handwriting-Based Alzheimer,s Disease Diagnosis Using a Low-Rank Mixture of Experts Deep Learning Framework
A low-rank mixture of experts model trained on handwriting data delivers strong Alzheimer's diagnosis performance with substantially reduced parameter activation during inference.
-
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.