Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
super hub Canonical reference
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Canonical reference. 79% of citing Pith papers cite this work as background.
abstract
Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minim
authors
co-cited works
representative citing papers
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
CPF-GCD enforces low-rank compositional structure on vision backbone features via spatial primitive fields so that novel categories emerge as new activation patterns over a shared vocabulary of reusable visual primitives.
Grouped Query Experts applies per-group MoE routing to query heads in GQA, matching baseline accuracy while activating half the query heads on 250M models trained for 30B tokens.
LoopMoE is a looped MoE language model that outperforms matched vanilla MoE on 8 of 9 downstream benchmarks at 3B scale and continues to outperform at 9B scale under strictly controlled budgets.
FilterMoE uses joint node-channel routing of Chebyshev filter experts through a 3D gating tensor in pre-propagation GNNs and outperforms baselines on nine of eleven benchmarks while ranking first on all three large-scale ones with a 1.53-point average gain.
ViBE co-optimizes expert placement with measured GPU performance variability in MoE inference to cut execution-time imbalance, delivering 14% better SLO attainment and up to 45% lower P90 TTFT.
A mean-field limit of a reinforcement-based softmax router for two experts shows a supercritical pitchfork bifurcation, with an external asymmetry unfolding it into a cusp of fold bifurcations.
Frontier is a new discrete-event simulator for disaggregated LLM serving that incorporates co-location, PDD, AFD, and optimizations, achieving under 4% throughput error and large reductions in latency prediction error versus prior simulators.
Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
BatMIL uses hybrid hyperbolic-Euclidean geometry, an S4 state-space backbone, and chunk-level mixture-of-experts to outperform prior multiple-instance learning methods on seven whole-slide image datasets across six cancers.
Approximate multipliers degrade MoE and dense DNNs at different rates; ResNet-20 recovers fully after retraining while VGG models often fail at aggressive approximations except Cluster MoE, and Hard MoE can outperform dense on ViT under cost-matched aggressive approximation.
Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.
MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.
A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.
FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.
Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
A mixture-of-experts transformer foundation model pretrained on diverse SEM images enables generalization across materials and outperforms SOTA on unsupervised defocus-to-focus restoration.
PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.
Split-MoPE integrates split learning with predefined-expert routing to maximize usable data in vertical federated learning under sample misalignment, delivering state-of-the-art accuracy in one communication round plus built-in robustness and per-sample contribution scores.
TWNM framework equips audio-language models with spatial scene analysis via FOA simulation and metadata-grounded training, reaching 70.8% accuracy on a new ASA benchmark.
citing papers explorer
-
Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
-
Identifying Latent Concepts and Structures for Generalized Category Discovery
CPF-GCD enforces low-rank compositional structure on vision backbone features via spatial primitive fields so that novel categories emerge as new activation patterns over a shared vocabulary of reusable visual primitives.
-
Grouped Query Experts: Mixture-of-Experts on GQA Self-Attention
Grouped Query Experts applies per-group MoE routing to query heads in GQA, matching baseline accuracy while activating half the query heads on 250M models trained for 30B tokens.
-
LoopMoE: Unifying Iterative Computation with Mixture-of-Experts for Language Modeling
LoopMoE is a looped MoE language model that outperforms matched vanilla MoE on 8 of 9 downstream benchmarks at 3B scale and continues to outperform at 9B scale under strictly controlled budgets.
-
Gate the Filter, Not the Message: Node-Channel Mixtures for Pre-Propagation GNNs
FilterMoE uses joint node-channel routing of Chebyshev filter experts through a 3D gating tensor in pre-propagation GNNs and outperforms baselines on nine of eleven benchmarks while ranking first on all three large-scale ones with a 1.53-point average gain.
-
ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving
ViBE co-optimizes expert placement with measured GPU performance variability in MoE inference to cut execution-time imbalance, delivering 14% better SLO attainment and up to 45% lower P90 TTFT.
-
A Minimal Bifurcation Model of Load Imbalance in a Softmax Mixture-of-Experts Router
A mean-field limit of a reinforcement-based softmax router for two experts shows a supercritical pitchfork bifurcation, with an external asymmetry unfolding it into a cusp of fold bifurcations.
-
Frontier: Towards Comprehensive and Accurate LLM Inference Simulation
Frontier is a new discrete-event simulator for disaggregated LLM serving that incorporates co-location, PDD, AFD, and optimizations, achieving under 4% throughput error and large reductions in latency prediction error versus prior simulators.
-
Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
-
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
-
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
-
Geometry-Aware State Space Model: A New Paradigm for Whole-Slide Image Representation
BatMIL uses hybrid hyperbolic-Euclidean geometry, an S4 state-space backbone, and chunk-level mixture-of-experts to outperform prior multiple-instance learning methods on seven whole-slide image datasets across six cancers.
-
AxMoE: Characterizing the Impact of Approximate Multipliers on Mixture-of-Experts DNN Architectures
Approximate multipliers degrade MoE and dense DNNs at different rates; ResNet-20 recovers fully after retraining while VGG models often fail at aggressive approximations except Cluster MoE, and Hard MoE can outperform dense on ViT under cost-matched aggressive approximation.
-
Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs
Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.
-
MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.
-
Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning
A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.
-
FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training
FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.
-
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
-
Depth Adaptive Efficient Visual Autoregressive Modeling
DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
-
A Mixture of Experts Foundation Model for Scanning Electron Microscopy Image Analysis
A mixture-of-experts transformer foundation model pretrained on diverse SEM images enables generalization across materials and outperforms SOTA on unsupervised defocus-to-focus restoration.
-
Path-Constrained Mixture-of-Experts
PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.
-
Mixture of Predefined Experts: Maximizing Data Usage on Vertical Federated Learning
Split-MoPE integrates split learning with predefined-expert routing to maximize usable data in vertical federated learning under sample misalignment, delivering state-of-the-art accuracy in one communication round plus built-in robustness and per-sample contribution scores.
-
The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models
TWNM framework equips audio-language models with spatial scene analysis via FOA simulation and metadata-grounded training, reaching 70.8% accuracy on a new ASA benchmark.
-
FaceMoE: Mixture of Experts for Low-Resolution Face Recognition
FaceMoE introduces a MoE transformer with top-k routed specialized FFN experts for resolution-aware feature extraction in low-resolution face recognition, outperforming prior methods on eleven datasets.
-
Mixture of Debaters: Learn to Debate at Architectural Level in Multi-Agent Reasoning
Mixture of Debaters uses MoE to enable dynamic self-debate inside one model, claiming better accuracy than multi-agent systems at 3.7x lower latency and 87% fewer tokens on multimodal benchmarks.
-
Does the Same Token Mean the Same State? MoE Routing as Signal for Reasoning Control
MoE routing states at boundary and delimiter anchors form basins that align with final answers, enabling RAD, a string-free multi-rollout selector that matches majority voting on math and code tasks.
-
CoRDE: Concept-Prior Routed Diffusion Experts for Structural Generalization in Robot Manipulation
CoRDE uses concept-prior variational distillation and LoRA-based expert pools to route diffusion models for structurally generalizable robot manipulation policies.
-
Behavioral and Representational Evidence of Binomial Ordering Preferences in Large Language Models
LLMs recover dominant binomial orders from corpora but align less closely with exact preference distributions, with preference strength partially encoded in middle-to-late layers and manipulable via steering.
-
MoECodec: Image Compression for joint human and machine perception via Mixture-of-Experts
MoECodec replaces FFN layers with token-wise MoE plus stable routing and GShMLP experts to support multiple downstream tasks in a single image compression model.
-
Achieving Cloud-Grade SLOs for Local Mixture-of-Experts Inference through CPU-GPU Hybrid Design
A CPU-GPU hybrid design with stream-loading prefill, expert parallelism, and disaggregation achieves cloud SLOs for local MoE inference on dual-socket CPUs and consumer GPUs.
-
SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation
SARM2 presents RM, a multi-task stage-aware reward model achieving 80% lower value-estimation MSE, which when used in SPIRAL boosts manipulation task success from ~50% to near-perfect on several benchmarks.
-
STAR: Rethinking MoE Routing as Structure-Aware Subspace Learning
STAR rethinks MoE routing as structure-aware subspace learning by adding a GHA-tracked principal subspace to standard routers, yielding more stable specialization and better performance on synthetic, language, and vision tasks.
-
Chiaroscuro Attention: Spending Compute in the Dark
CHIAR-Former routes tokens via spectral entropy to DCT mixing or attention, yielding 35-40% FLOP savings at 400M parameters with modest perplexity increase on WikiText-103.
-
AlphaQ: Calibration-Free Bit Allocation for Mixture-of-Experts Quantization
AlphaQ performs calibration-free mixed-precision quantization of MoE models by allocating higher bits to experts whose weight spectra exhibit stronger heavy-tailed structure according to HT-SR theory, outperforming calibration-based methods and reaching near full-precision accuracy at 3.5 average bi
-
Expert-Aware Refusal Steering
Refusal steering works on MoE LLMs; expert-aware variants succeed with single-expert outputs and refusal signals differ from routing patterns.
-
DOT-MoE: Differentiable Optimal Transport for MoEfication
DOT-MoE uses differentiable optimal transport and straight-through estimators to partition FFN layers into capacity-constrained experts, outperforming heuristic baselines in retaining 90% performance at 50% active parameters.
-
Learning Multi-Modal Trajectory Policies for Data-Efficient Robotic Manipulation
MATE is a multi-modal MoE trajectory policy using a cosine router and stochastic noise to improve expert balance, reporting 4.75% higher average success rate than prior methods on LIBERO under data scarcity.
-
Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models
AsyMoE adds hyperbolic geometry for cross-modal hierarchies and evidence-priority experts to address vision-language asymmetry in LVLMs, reporting 1.5% average gains and 25.45% fewer active parameters.
-
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
VISUALTHINK-VLA uses visual evidence tokens and selective routing to reach top success rates on VLA benchmarks while cutting reasoning latency from multi-second to sub-second levels.
-
Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks
Standard LLM inference benchmarks introduce systematic bias via GIL-induced queuing in single-process asyncio setups; a multi-process framework and NTPOT metric isolate true serving engine performance at high query rates.
-
Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models
Complete-muE combines active-width μP and activated-expert scaling to transfer hyperparameters across dense FFN, dense MoE, and sparse MoE while covering changes in experts, capacity, width, depth, batch size, and duration.
-
NASiC: 3D NAND-based CAM-Selected Multibit CIM Architecture for Efficient On-Device Mixture-of-Experts LLM Inference
NASiC fuses CAM-based expert selection and multibit CIM computation in 3D NAND into one cycle for MoE LLM inference, claiming 4-114.8x performance and 3.9-70x energy efficiency gains over prior designs with high accuracy.
-
Exploiting Multicast for Accelerating Collective Communication
MultiWrite is a new many-to-many transmission semantic that uses multicast principles to eliminate redundant packets in collective operations, delivering up to 33% lower latency for AllGather and AlltoAll on Ascend NPUs.
-
PALS: Power-Aware LLM Serving for Mixture-of-Experts Models
PALS adds dynamic GPU power capping to LLM serving frameworks like vLLM, jointly tuning it with batch size via offline models and feedback control to improve energy efficiency up to 26.3% and cut QoS violations 4-7x on dense and MoE models.
-
FedCoE: Bridging Generalization and Personalization via Federated Coordinated Dual-level MoEs
FedCoE proposes a coordinated dual-level MoE framework for federated learning that improves global and personalized accuracy while enabling strong cold-start performance for new clients.
-
HDMoE: A Hierarchical Decoupling-Fusion Mixture-of-Experts Framework for Multimodal Cancer Survival Prediction
HDMoE uses hierarchical MoE and RFR modules to address redundant information and fine-grained intra/inter-modality relationships in multimodal cancer survival prediction, with positive results on private liver cancer and TCGA datasets.
-
GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems
GEM is a GPU-variability-aware expert-to-GPU mapping framework for MoE inference that classifies experts as consistent or temporal and places them to equalize finish times across heterogeneous GPUs.
-
What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code
Controlled experiments show structured reasoning traces and higher-density math-domain samples improve mathematical reasoning more than pure executable code, with internal routing patterns reflecting these data effects.
-
Post-Trained MoE Can Skip Half Experts via Self-Distillation
ZEDA turns post-trained static MoE models into dynamic ones via zero-output expert injection and two-stage self-distillation, cutting over 50% expert FLOPs on Qwen3-30B-A3B and GLM-4.7-Flash with small accuracy drops across 11 benchmarks.
-
Scalable Knowledge Editing for Mixture-of-Experts LLMs via Tensor-Structured Updates
A MEMIT-style knowledge editing framework for MoE LLMs that formulates per-expert updates via tensor structure and applies Woodbury identity for low-rank inversions, achieving up to 6x speedup with comparable editing quality.