Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns

· 2026 · cs.LG · arXiv 2604.23150

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Most recent state-of-the-art (SOTA) large language models (LLMs) use Mixture-of-Experts (MoE) architectures to scale model capacity without proportional per-token compute, enabling higher-quality outputs at manageable serving costs. However, MoE inference at scale is fundamentally bottlenecked by expert load imbalance and inefficient token routing, especially in multi-node deployments where tokens are not guaranteed to be routed to local experts, resulting in significant inter-node all-to-all communication overhead. To systematically characterize these challenges, we profile SOTA open-source MoE models, including Llama 4 Maverick, DeepSeek V3-671B, and Qwen3-230B-A22B, on various datasets and collected over 100k real expert activation traces. Upon studying the expert activation patterns, we uncover various persistent properties across all the frontier MoE models: variable expert load imbalance, domain-specific expert activation where expert popularity shifts across task families (code, math, chat, general), and a strong correlation between prefill and decode expert activations. Motivated by these findings, we propose workload-aware micro-batch grouping and an expert placement strategy to maximize token locality to the destination expert, thereby reducing inter-node communication. Across models and datasets, these optimizations help reduce all2all communication data up to 20, resulting in lower MoE decode latency and better accelerator utilization.

representative citing papers

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

cs.DC · 2026-07-01 · unverdicted · novelty 6.0

ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving by routing decode requests via prefill-derived expert signatures and K-means locality partitioning over load-balancing baselines.

Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference

cs.LG · 2026-05-31 · unverdicted · novelty 6.0

Task-aware expert grouping derived from family-specific co-activation traces cuts average communication cost 31.39% versus task-agnostic baselines in multi-task MoE inference while maintaining Jain fairness near 1.0.

citing papers explorer

Showing 2 of 2 citing papers.

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving cs.DC · 2026-07-01 · unverdicted · none · ref 2 · internal anchor
ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving by routing decode requests via prefill-derived expert signatures and K-means locality partitioning over load-balancing baselines.
Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference cs.LG · 2026-05-31 · unverdicted · none · ref 7 · internal anchor
Task-aware expert grouping derived from family-specific co-activation traces cuts average communication cost 31.39% versus task-agnostic baselines in multi-task MoE inference while maintaining Jain fairness near 1.0.

Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns

fields

years

verdicts

representative citing papers

citing papers explorer