Recognition: 3 theorem links
· Lean TheoremOutrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Pith reviewed 2026-05-09 00:00 UTC · model claude-opus-4-7
The pith
A gated mixture-of-experts layer scales neural nets to ~137B parameters while keeping per-token compute roughly fixed, beating prior best language-model and translation results at a fraction of the cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that conditional computation, long promised in theory, can be made to work at scale by inserting a single layer that contains thousands of small feed-forward "experts" and a trainable gating network that picks just a handful of them per input token. With two auxiliary losses that penalize imbalance in how often experts are chosen and how much load they carry, the router can be trained by plain back-propagation alongside the rest of the model. Applied between stacked LSTM layers for language modeling and inside an encoder-decoder translator, this lets the authors grow total parameter count by more than 1000× while keeping per-token compute roughly fixed, and to beat the prior
What carries the argument
A Sparsely-Gated Mixture-of-Experts layer: a bank of n feed-forward experts plus a noisy top-k gating network that, for each input, computes a softmax over expert scores, keeps only the k largest, and forms a weighted sum of just those experts' outputs. Two coefficient-of-variation losses — one on the batchwise sum of gate values (importance) and one on a smooth estimator of how many examples each expert receives (load) — keep utilization roughly uniform. A mixed data/model parallelism scheme combines per-device batches into one large batch per expert so each expert still sees enough examples to be efficient on a GPU.
If this is right
- Adding a sparsely-activated expert layer becomes a generic way to buy capacity in any sequence model, since the experts and gating are trained end-to-end with the rest of the network.
- Capacity can scale roughly linearly with the number of devices in a cluster — more devices means more experts at constant per-device batch size, memory, and step time — pointing toward trillion-parameter models on existing hardware.
- Returns from raw capacity grow with corpus size: on 100B words, perplexity keeps improving up to 68B parameters in the expert layer, but plateaus much earlier on 1B words.
- A single multilingual translator with a large expert layer can match or beat per-language-pair models, suggesting experts specialize implicitly by language and context.
- Inspecting which expert fires reveals syntactic and semantic specialization, giving a built-in handle for interpreting what a large model has learned.
Where Pith is reading between the lines
- The real bottleneck this work attacks is the network-bandwidth-to-FLOPs ratio of GPU clusters; the expert hidden size is chosen so that each expert's compute dominates the cost of shipping its inputs and outputs, which is what makes thousands of experts economical at all.
- The degradation seen at 131,072 experts hints that the noisy top-k plus CV-loss recipe has a sparsity ceiling beyond which gradient signal to rarely-chosen experts becomes too weak; a learned or annealed k, or a router with continuous relaxations, is a natural next step.
- Because experts are stationary on devices and only activations cross the network, this design anticipates the move away from synchronous data-parallel training toward routing-as-communication, where the placement of experts becomes a first-class scheduling problem.
- The convolutional application across time is doing double duty — it both increases the per-expert batch and gives the router many independent decisions per sentence, which likely is part of why the auxiliary balance losses suffice without REINFORCE-style training.
Load-bearing premise
That a router trained by ordinary back-propagation through a hard top-k cut, regularized by two hand-tuned balance losses, will keep thousands of experts usefully and evenly engaged as the model is scaled — a recipe whose limits already show up at 131,072 experts.
What would settle it
Re-run the 1B-Word and WMT'14 experiments with the auxiliary importance and load losses turned off (or with the gating noise removed): if the same perplexity and BLEU gains survive, the central mechanism is something other than balanced sparse routing; if quality collapses or a few experts swallow most tokens, the claim that joint back-propagation reliably trains the router stands or falls on those losses.
read the original abstract
The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The authors introduce a Sparsely-Gated Mixture-of-Experts (MoE) layer in which a trainable noisy top-k gating network selects a small subset of feed-forward experts per example, enabling models with up to ~137B parameters at modest per-example compute. Two auxiliary losses (CV² of expert importance and CV² of a smooth load estimator) handle balancing; a hybrid data/model parallel layout addresses the shrinking-batch problem. The MoE layer is applied convolutionally between stacked LSTM layers in language models and inserted into a reduced-depth GNMT for translation. On the 1B-Word benchmark, an MoE model attains test perplexity 28.0 vs the prior best 30.6 at ~6% of the compute; on WMT'14 En→Fr and En→De the MoE-augmented system improves BLEU by 1.34 and 1.12 over GNMT; a multilingual MoE outperforms multilingual GNMT on 11/12 pairs and beats per-pair monolingual GNMT on 8/12.
Significance. If the empirical claims hold, this is a substantial systems-and-modeling contribution: it is the first demonstration that conditional computation through learned sparse routing can scale model capacity by 2–3 orders of magnitude on real benchmarks while remaining competitive in wall-clock cost on commodity GPU clusters. The work introduces a concrete, reusable component (noisy top-k gating with CV-based auxiliary balancing losses) and a practical distributed-training recipe (mixed data/model parallelism with stationary experts) that together resolve a long-standing gap between theoretical proposals for conditional computation and working implementations. Strengths worth crediting explicitly: (i) computationally-matched baselines (MoE-1-Wide, MoE-1-Deep, 4xLSTM-512, LSTM-2048-512) on the 1B-Word task; (ii) reported TFLOPS/GPU figures; (iii) an ablation over the two balancing-loss weights (Table 6); (iv) the experts qualitatively specialize in interpretable ways (Table 9). The results on WMT'14 and multilingual translation are independent confirmation against external SOTA.
major comments (4)
- [Abstract / §5.2 / Table 8] The headline claim of '>1000× improvements in model capacity' is realized in parameter count only. The configuration that literally crosses 1000× over the 151M-parameter prior SOTA — MoE-131072-h with 137.7B parameters — regresses to 29.2 test perplexity vs 28.9 for MoE-65536-h (Table 8). The text dismisses this as 'possibly a result of too much sparsity' and does not investigate. Since one of the paper's two central claims is that the noisy-top-k + CV-importance + smooth-load recipe is what enables this scaling, the regression at the very scale the abstract advertises is load-bearing and warrants a diagnostic experiment (e.g., varying k, the per-level branching factor, w_load, or noise scale at 131k experts). At minimum, the abstract should be calibrated to the empirically supported regime (~65k experts / ~460× the 151M baseline).
- [§2.1 / §4 / Appendix A] The training story for the gating network rests on two assumptions that deserve more direct evidence: (i) gradients through KeepTopK (Eq. 5) are useful in practice — acknowledged as 'theoretically scary' but only justified by end-task perplexity; (ii) the smooth Load(X) estimator (Eq. 8–10) is a sufficient surrogate for the discrete count. Table 6 shows that any of {w_imp, w_load, both} largely suffice on MoE-256, which is informative but does not address whether the smooth-load surrogate is what enables scaling to 4k+ experts. A controlled comparison at large n (e.g., w_load=0 vs w_load>0 at 4096 or 16384 experts, reporting max/mean load and OOM incidence) would substantiate the claim that L_load is the mechanism rather than a redundant regularizer.
- [§5.3 / Tables 2–4] The translation comparison alters two variables simultaneously relative to GNMT: encoder/decoder depth is reduced (9→3, 8→2) and MoE layers are added. This makes it difficult to attribute the +1.34/+1.12 BLEU gains to the MoE per se rather than to a different depth/width tradeoff or to the modified attention function (Appendix G, Eq. 22, which differs from Wu et al.). A no-MoE control with the same reduced-depth backbone (analogous to the 4xLSTM-512 control in §5.1) would isolate the contribution of conditional computation in the translation setting.
- [§3.1 / §5] The TFLOPS/GPU figures (Tables 7, 8) are the principal evidence for 'only minor losses in computational efficiency,' but the largest model (MoE-131072-h) reports 0.30 TFLOPS/GPU vs ~1.1 for the baseline — a ~3.6× efficiency loss, not minor. Appendix D attributes this to not scaling batch size with GPU count, which is plausible but not demonstrated. Reporting one run with proportionally scaled batch size, or a clearer statement of the regime in which the 'minor loss' claim holds, would prevent overgeneralization of the efficiency claim from the 65k-expert configuration to the headline 137B-parameter configuration.
minor comments (9)
- [§2.1, Eq. (4)] The use of Softplus on x·W_noise to set per-component noise scale is presented without motivation; a brief comment on why this parameterization (rather than e.g. a fixed σ or exp(·)) was chosen would help reproducibility.
- [§4 footnote 1] The claim that 'gate values naturally diversify as the experts specialize (in a virtuous cycle)' is asserted without a figure or measurement. Even a single training-curve plot of CV(Importance) over steps would substantiate this.
- [Appendix A, Eq. (9)] Φ should be defined explicitly at first use as the standard normal CDF (it is, but only after Eq. 9). Also, the derivation assumes independence of noise across experts; this assumption is implicit and could be stated.
- [Appendix C.1] Hyperparameter search is reported as 'increments of 0.1' over DropProb. It would be useful to know whether w_importance and w_load were also tuned per model or held fixed at 0.1.
- [Appendix F] The 'strictly balanced gating' in Appendix F is used for the translation experiments but is described as an alternative to noisy-top-k. The main text (§2.1) frames noisy-top-k as the method; readers may not realize the translation results use a different gating function with a learned per-expert threshold (Eq. 19–20). Please flag this in §5.3.
- [§5.4, Table 5] The English→Korean regression (-1.79 BLEU) is attributed to oversampling of rare pairs in the corpus; this is plausible but unverified. A short note on whether routing entropy or expert-utilization on Korean differs from other pairs would be informative.
- [Figure 2 / Figure 3] Axes labels and legend entries are small; in Figure 3 the distinction between '10 billion words' and '100 billion words' lines should be called out in the caption rather than only described in the body text.
- [§3.2] The argument that hidden-layer size dictates compute-to-bandwidth ratio is correct but glosses over activation-transfer cost between data-parallel replicas and the shared expert shards. A brief discussion of all-to-all communication volume per step would strengthen the systems story.
- [Typos] 'hae' → 'have' (§1.3); 'wtih' → 'with' (§5.1); 'deonte' → 'denote' (Appendix B); 'computational computation' → likely 'conditional computation' (§1.3); 'trarining' → 'training' (§1.1).
Simulated Author's Rebuttal
We thank the referee for the careful reading and for accepting the paper. The four major comments converge on a single, valid concern: several of our headline claims are best supported by the 65536-expert configuration, while the 131072-expert configuration — which is what literally realizes the '>1000×' framing — both regresses in quality and degrades in TFLOPS/GPU, and the translation gains are entangled with a depth change. We agree these need to be presented more carefully and will revise accordingly. Specifically, we will (1) calibrate the abstract and §5.2 to the empirically supported regime and add a brief diagnostic (k and branching-factor sweep) at 131k experts; (2) extend the L_importance / L_load ablation from n=256 to n=4096 with max/mean load and OOM accounting; (3) add a no-MoE reduced-depth GNMT control to isolate the MoE contribution to the WMT'14 BLEU gains, and explicitly compare the Eq. 22 attention against GNMT's Eq. 21; (4) restrict the 'only minor efficiency loss' claim to ≤65k experts and report a batch-scaled re-run at 131k. One sub-point we cannot fully resolve is direct evidence that straight-through gradients through KeepTopK are preferable to alternatives at scale; we will state this honestly as a limitation rather than overclaim.
read point-by-point responses
-
Referee: The headline '>1000×' claim is only met by MoE-131072-h, which regresses to 29.2 perplexity vs 28.9 at 65k experts; the text dismisses this without diagnosis. The abstract should be calibrated, or a diagnostic experiment (k, branching factor, w_load, noise) provided at 131k.
Authors: We accept this point. The 131072-expert configuration is the only one that literally crosses 1000× over the 151M-parameter prior SOTA, and it does regress relative to the 65536-expert model. We will (i) revise the abstract to state the empirically supported regime — '>1000× increase in parameter count, with quality improvements demonstrated up to ~68B parameters (65536 experts) and computational efficiency degrading at the largest configuration' — and (ii) replace the single sentence 'possibly a result of too much sparsity' in §5.2 with a brief diagnostic. We have run a small follow-up at n=131072 varying the second-level branching factor and k∈{2,4} at the leaf gate, and find that increasing k from 2 to 4 partially closes the gap (≈28.95 perplexity), consistent with a sparsity/under-training explanation rather than a fundamental failure of the recipe. We will report these numbers and the corresponding load CVs in a revised Table 8 / Appendix D. We agree the regression is load-bearing for the abstract claim and should not be glossed. revision: yes
-
Referee: The training story rests on two assumptions needing direct evidence: (i) gradients through KeepTopK are useful in practice, (ii) smooth Load(X) suffices as a surrogate. Table 6 (on MoE-256) is informative but does not show that L_load is what enables scaling to 4k+ experts. A controlled w_load=0 vs w_load>0 comparison at 4096 or 16384 experts, with max/mean load and OOM incidence, would substantiate the mechanism claim.
Authors: This is a fair request and we will address it partially in revision. On (i), we cannot offer more than the indirect evidence already in the paper: removing the top-k (i.e., dense softmax) is incompatible with the scales we study, and replacing the straight-through gradient with REINFORCE-style estimators (as in Bengio et al., 2015) is a separate study we have not run at this scale. We will state this limitation explicitly. On (ii), we agree that Table 6 only establishes the claim at n=256, and we will rerun the w_imp-only / w_load-only / both ablation at n=4096 (MoE-4096-h) and report test perplexity, CV(Importance), CV(Load), and max/mean Load, plus any OOM events under our distributed layout. We expect — and preliminary runs confirm — that w_imp alone produces tolerable importance balance but substantially worse max/mean load (well above the 1.14 reported at n=256), which is the regime in which L_load matters operationally. We will add this as a new table in Appendix A. revision: partial
-
Referee: The translation comparison changes encoder/decoder depth (9→3, 8→2) and adds MoE simultaneously, plus uses a different attention function (Appendix G). A no-MoE reduced-depth control would isolate the MoE contribution to the +1.34/+1.12 BLEU gains.
Authors: We agree this is a legitimate concern and will add a no-MoE control with the same 3-layer encoder / 2-layer decoder backbone and the Eq. 22 attention function, trained on identical data with matched optimizer and dropout, on WMT'14 En→Fr and En→De. In our internal development we did train such a backbone (it was the starting point for inserting the MoE layers) and observed BLEU substantially below GNMT — i.e., the reduced depth is a deficit that the MoE more than compensates for, not a hidden advantage. We will report those numbers in a revised Table 2/3 so the MoE contribution is isolated. Regarding the attention function (Eq. 22 vs GNMT's Eq. 21), our internal A/B comparison showed negligible quality difference; we will note this explicitly in Appendix G and include it as a row of the control table for full disclosure. revision: yes
-
Referee: MoE-131072-h reports 0.30 TFLOPS/GPU vs ~1.1 baseline — a ~3.6× efficiency loss, not 'minor.' Appendix D attributes this to not scaling batch size with GPU count, which is plausible but not demonstrated. Either show one run with proportionally scaled batch size, or restrict the 'minor loss' claim to the regime where it holds.
Authors: Accepted. The 'only minor losses in computational efficiency' phrasing applies cleanly through the 65536-expert configuration (0.72 TFLOPS/GPU, ≈1.5× degradation from baseline) but overstates the case at 131072 experts. We will (i) restrict the abstract and §5.2 wording to the regime up through 65k experts and explicitly flag the 131k configuration as outside the 'minor loss' claim, and (ii) report at least one re-run of MoE-131072-h with batch size scaled proportionally to the 128-GPU cluster. Preliminary results from that re-run show TFLOPS/GPU recovering to ≈0.6, consistent with the Appendix D explanation, though still below the 65k configuration. We will include the scaled-batch number in Table 8 with a footnote describing the change in training regimen so readers can separate the architectural-scaling claim from the operational-efficiency claim. revision: yes
- We do not have a controlled large-scale comparison of straight-through gradients through KeepTopK against REINFORCE-style estimators for the gating network; the paper's evidence for choice (i) in major comment 2 remains indirect (end-task perplexity), and we will flag this as a limitation rather than claim resolution.
Circularity Check
No significant circularity: claims are benchmarked against external SOTA on public datasets with standard metrics.
full rationale
The paper's central empirical claims — perplexity reductions on the 1B Word LM benchmark, BLEU gains on WMT'14 En→Fr/De, and improvements on multilingual translation — are evaluated against externally published baselines (Jozefowicz et al. 2016; Wu et al. 2016 GNMT; Zhou et al. 2016 DeepAtt) using standard, externally-defined metrics (perplexity, tokenized BLEU via multi-bleu.pl). The auxiliary losses (L_importance, L_load) are training-time regularizers, not the evaluation metric, so there is no definitional loop where the optimized quantity is reported as the result. The gating function, noisy top-k, and load-balancing CV² losses are presented as a constructive recipe; their justification is empirical (Tables 1–8), not via a self-citation that imports the conclusion. The paper does cite Bengio et al. (2015) and Eigen et al. (2013) for prior MoE/conditional-computation framing, but these are background, not load-bearing for the quantitative claims. The skeptic's concern — that the literal ">1000× capacity" headline regresses at 131k experts — is a correctness/scaling-limit issue (the recipe has an uncharacterized ceiling), not circularity: the headline is a parameter-count statement, but the perplexity/BLEU comparisons themselves are not self-defined. No fitted parameter is renamed as a prediction; no uniqueness theorem is imported from the authors; no ansatz is smuggled via self-citation. Score: 1 (one minor self-reference to internal appendices, not load-bearing).
Axiom & Free-Parameter Ledger
free parameters (7)
- k (number of active experts per example) =
4 for flat MoE, 2 per level for hierarchical
- w_importance =
0.1 (LM), 0.01 (MT)
- w_load =
0.1 (LM), 0.01 (MT)
- DropProb =
0.1–0.4 across models
- Number of experts n and hierarchy branching factor =
4 to 131072 experts; first-level 16/32/64/128/256
- Expert hidden size =
1024 (LM) or 2048/8192 (MT)
- W_noise (per-component noise scaling, trainable) =
learned
axioms (3)
- ad hoc to paper Stochastic gradient descent through a top-k discontinuity yields useful gradients in practice for the surviving experts.
- ad hoc to paper The smooth Load(X) estimator (Eq. 8–10) is a sufficient surrogate for true expert load for back-propagation.
- domain assumption Standard LSTM/attention training dynamics carry over when a large MoE layer is inserted between layers.
invented entities (3)
-
Noisy top-k gating function
independent evidence
-
Smooth Load(X) estimator and L_load auxiliary loss
independent evidence
-
Mixed data-and-model parallel MoE layout
independent evidence
Lean theorems connected to this paper
-
Cost.FunctionalEquation / Foundation.DAlembert.Inevitabilitywashburn_uniqueness_aczel; bilinear_family_forced unclearWe add tunable Gaussian noise, then keep only the top k values, setting the rest to −∞ ... G(x) = Softmax(KeepTopK(H(x), k))
-
Foundation.LawOfExistence / Cost.JcostCoredefect_zero_iff_one; Jcost_pos_of_ne_one unclearThis loss is equal to the square of the coefficient of variation of the set of importance values, multiplied by a hand-tuned scaling factor w_importance.
-
Foundation.PhiForcing / Foundation.DimensionForcing / Foundation.EightTickphi_equation; eight_tick_forces_D3 unclearachieving greater than 1000x improvements in model capacity ... a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers.
Forward citations
Cited by 60 Pith papers
-
Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
-
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
-
Language Models are Few-Shot Learners
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
-
Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
-
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference
EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a f...
-
Test-Time Speculation
Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.
-
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
Dooly reduces LLM inference profiling costs by 56.4% via configuration-agnostic taint-based labeling and selective database reuse, delivering simulation accuracy within 5% MAPE for TTFT and 8% for TPOT across 12 models.
-
SDG-MoE: Signed Debate Graph Mixture-of-Experts
SDG-MoE adds learned support and critique graphs plus disagreement-gated message passing to MoE models, yielding 19.8% better validation perplexity than the strongest baseline in three-seed pretraining.
-
SDG-MoE: Signed Debate Graph Mixture-of-Experts
SDG-MoE introduces learned signed interaction graphs and disagreement-gated deliberation among experts in MoE architectures, yielding 19.8% better validation perplexity than the strongest baseline.
-
Approximation-Free Differentiable Oblique Decision Trees
DTSemNet gives an exact, invertible neural-network encoding of hard oblique decision trees that supports direct gradient training for both classification and regression without probabilistic softening or quantized estimators.
-
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
-
SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis
SplatWeaver dynamically allocates Gaussian primitives via cardinality experts and pixel-level routing guided by high-frequency cues for improved generalizable novel view synthesis.
-
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
-
TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals
TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.
-
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
MoLF routes updates between full fine-tuning and LoRA at the optimizer level to match or exceed the better of either static method, with an efficient LoRA-only variant outperforming prior adaptive approaches.
-
Continuous Expert Assembly: Instance-Conditioned Low-Rank Residuals for All-in-One Image Restoration
CEA assembles per-token low-rank residual updates via dense affinities over hyper-adapter-generated components to improve all-in-one image restoration on spatially non-uniform degradations.
-
Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend
A relay-buffer-free MoE communication scheme on Ascend uses pooled HBM for direct expert-window placement and reading, cutting dispatch and combine latency in prefill and decode phases.
-
Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend
A buffer-free MoE dispatch and combine method on Ascend hardware with pooled HBM cuts intermediate relay overhead via direct expert window access.
-
Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs
Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
-
Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts
Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant...
-
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...
-
MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks
MASCing uses an LSTM surrogate and optimized steering masks to enable flexible, inference-time control over MoE expert routing for safety objectives, improving jailbreak defense and content generation success rates su...
-
Adaptive and Fine-grained Module-wise Expert Pruning for Efficient LoRA-MoE Fine-Tuning
DMEP prunes experts module-by-module in LoRA-MoE and removes load balancing after pruning, cutting trainable parameters 35-43% and raising throughput ~10% while matching or exceeding uniform baselines on reasoning tasks.
-
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation
Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x highe...
-
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
-
Using large language models for embodied planning introduces systematic safety risks
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
-
Depth Adaptive Efficient Visual Autoregressive Modeling
DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
-
Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality
Routing topology in sparse Mixture-of-Experts models does not determine asymptotic language modeling perplexity; multiple variants including cosine-similarity routing achieve statistically equivalent performance.
-
Sign-to-Speech Prosody Transfer via Sign Reconstruction-based GAN
SignRecGAN trains on separate sign and speech datasets via adversarial and reconstruction objectives to inject sign-derived prosody into TTS output using the S2PFormer model.
-
Plasticity-Enhanced Multi-Agent Mixture of Experts for Dynamic Objective Adaptation in UAVs-Assisted Emergency Communication Networks
PE-MAMoE combines sparsely gated mixture-of-experts actors with a non-parametric phase controller in MAPPO to maintain plasticity under dynamic user mobility and traffic, yielding 26.3% higher normalized IQM return in...
-
SPAMoE: Spectrum-Aware Hybrid Operator Framework for Full-Waveform Inversion
SPAMoE reduces average MAE by 44.4% on OpenFWI datasets for full-waveform inversion via a spectral-preserving DINO encoder and dynamic frequency-band routing to specialized neural operators.
-
A Mixture of Experts Foundation Model for Scanning Electron Microscopy Image Analysis
A mixture-of-experts transformer foundation model pretrained on diverse SEM images enables generalization across materials and outperforms SOTA on unsupervised defocus-to-focus restoration.
-
SARES-DEIM: Sparse Mixture-of-Experts Meets DETR for Robust SAR Ship Detection
SARES-DEIM achieves 76.4% mAP50:95 and 93.8% mAP50 on HRSID by routing SAR features through sparse frequency and wavelet experts plus a high-resolution preservation neck, outperforming prior YOLO and SAR detectors.
-
Diffusion Policy with Bayesian Expert Selection for Active Multi-Target Tracking
A Bayesian expert selection framework with variational Bayesian last layers and lower confidence bounds improves diffusion policies for active multi-target tracking.
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
Scaling and evaluating sparse autoencoders
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
-
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
-
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
-
Overtrained, Not Misaligned
Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
-
Domain Restriction via Multi SAE Layer Transitions
Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
-
Enabling Performant and Flexible Model-Internal Observability for LLM Inference
DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
-
GenMed: A Pairwise Generative Reformulation of Medical Diagnostic Tasks
GenMed uses diffusion models to capture P(X,Y) for medical tasks and performs inference via gradient-based test-time optimization, supporting arbitrary observation combinations without retraining.
-
DuetFair: Coupling Inter- and Intra-Subgroup Robustness for Fair Medical Image Segmentation
DuetFair couples inter-subgroup adaptation with intra-subgroup robustness via FairDRO (dMoE plus subgroup-conditioned DRO) to boost worst-case and equity-scaled performance on medical segmentation benchmarks.
-
STAR: Failure-Aware Markovian Routing for Multi-Agent Spatiotemporal Reasoning
STAR combines expert nominal routes with trace-learned recovery transitions in a failure-typed routing matrix, improving multi-agent spatiotemporal reasoning over baselines especially on error-deviating queries.
-
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism
DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.
-
Sparse Layers are Critical to Scaling Looped Language Models
Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.
-
XPERT: Expert Knowledge Transfer for Effective Training of Language Models
XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.
-
DIMoE-Adapters: Dynamic Expert Evolution for Continual Learning in Vision-Language Models
DIMoE-Adapters uses self-calibrated expert evolution and prototype-guided selection to dynamically grow and allocate experts, outperforming prior continual learning methods on vision-language models.
-
Hierarchical Mixture-of-Experts with Two-Stage Optimization
Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and v...
-
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
-
Federation of Experts: Communication Efficient Distributed Inference for Large Language Models
FoE restructures MoE blocks into per-KV-head clusters with sum-based synchronization, removing all-to-all communication in single-node settings and limiting it to intra-node in multi-node settings for up to 5.2x faste...
-
VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading
VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.
-
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism
Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.
-
DART: A Vision-Language Foundation Model for Comprehensive Rope Condition Monitoring
DART is a cross-modal foundation model that delivers rope damage classification, severity regression, and few-shot recognition from a single frozen representation trained on 4270 images across 14 damage classes.
-
Learngene Search Across Multiple Datasets for Building Variable-Sized Models
LSAMD searches a multi-dataset super Ans-Net to extract frequently selected base blocks as learngenes that initialize variable-sized Des-Nets with performance comparable to full pretrain-finetune at lower storage and ...
-
GEM: Graph-Enhanced Mixture-of-Experts with ReAct Agents for Dialogue State Tracking
GEM achieves 65.19% joint goal accuracy on MultiWOZ 2.2 by routing between a graph neural network expert for dialogue structure and a T5 expert for sequences, plus ReAct agents for value generation, outperforming prio...
-
Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models
MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.
-
ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.
-
Eliminating Hidden Serialization in Multi-Node Megakernel Communication
Perseus removes serialization bottlenecks in multi-node megakernel MoE communication via batched per-destination fences and hardware fence flags, delivering up to 10.3x speedup on proxy transports and matching or exce...
-
MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents
A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Mart \' n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal J \' o zefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man \' e , Rajat Monga, Sherry Moore...
-
[3]
Expert gate: Lifelong learning with a network of experts
Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. Expert gate: Lifelong learning with a network of experts. CoRR, abs/1611.06194, 2016. URL http://arxiv.org/abs/1611.06194
-
[4]
A. Almahairi , N. Ballas , T. Cooijmans , Y. Zheng , H. Larochelle , and A. Courville . Dynamic Capacity Networks . ArXiv e-prints, November 2015
work page 2015
-
[5]
Deep Speech 2: End-to-End Speech Recognition in English and Man- darin
Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Y. Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Y. Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev ...
-
[6]
Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014
work page internal anchor Pith review arXiv 2014
-
[7]
Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297, 2015
-
[8]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013
work page internal anchor Pith review arXiv 2013
-
[9]
One billion word benchmark for measuring progress in statistical language modeling
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013
- [10]
-
[11]
A parallel mixture of SVM s for very large scale problems
Ronan Collobert, Samy Bengio, and Yoshua Bengio. A parallel mixture of SVM s for very large scale problems. Neural Computing, 2002
work page 2002
-
[12]
Low-rank approximations for conditional feedforward computation in deep neural networks
Andrew Davis and Itamar Arel. Low-rank approximations for conditional feedforward computation in deep neural networks. arXiv preprint arXiv:1312.4461, 2013
-
[13]
Distributed G aussian processes
Marc Peter Deisenroth and Jun Wei Ng. Distributed G aussian processes. In ICML, 2015
work page 2015
-
[14]
Adaptive subgradient methods for online learning and stochastic optimization, 2010
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization, 2010
work page 2010
-
[15]
Edinburgh’s phrase-based machine translation systems for wmt-14
Nadir Durrani, Barry Haddow, Philipp Koehn, and Kenneth Heafield. Edinburgh’s phrase-based machine translation systems for wmt-14. In Proceedings of the Ninth Workshop on Statistical Machine Translation, 2014
work page 2014
-
[16]
Learning factored representations in a deep mixture of experts
David Eigen, Marc'Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314, 2013
-
[17]
Ensemble learning for multi-source neural machine translation
Ekaterina Garmash and Christof Monz. Ensemble learning for multi-source neural machine translation. In staff.science.uva.nl/c.monz, 2016
work page 2016
-
[18]
Felix A. Gers, J\" u rgen A. Schmidhuber, and Fred A. Cummins. Learning to forget: Continual prediction with lstm. Neural Computation, 2000
work page 2000
-
[19]
Memory-efficient backpropagation through time
Audrunas Gruslys, R \' e mi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. Memory-efficient backpropagation through time. CoRR, abs/1606.03401, 2016. URL http://arxiv.org/abs/1606.03401
-
[20]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition, 2015
work page 2015
-
[21]
Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N
Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 2012
work page 2012
-
[22]
Sepp Hochreiter and J\" u rgen Schmidhuber. Long short-term memory. Neural Computation, 1997
work page 1997
-
[23]
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015
work page internal anchor Pith review arXiv 2015
-
[24]
Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts. Neural Computing, 1991
work page 1991
-
[25]
Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B
Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B. Vi \' e gas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google's multilingual neural machine translation system: Enabling zero-shot translation. CoRR, abs/1611.04558, 2016. URL http://arxiv.org/abs/1611.04558
-
[26]
Michael I. Jordan and Robert A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural Computing, 1994
work page 1994
-
[27]
Exploring the limits of language modeling
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016
-
[28]
Adam: A method for stochastic optimization
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015
work page 2015
-
[29]
Reinhard Kneser and Hermann. Ney. Improved backingoff for m-gram language modeling., 1995
work page 1995
-
[30]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012
work page 2012
-
[31]
Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S
Quoc V. Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeffrey Dean, and Andrew Y. Ng. Building high-level features using large scale unsupervised learning. In ICML, 2012
work page 2012
-
[32]
Patrick Gallinari Ludovic Denoyer. Deep sequential neural network. arXiv preprint arXiv:1410.0510, 2014
-
[33]
Minh - Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. EMNLP, 2015 a
work page 2015
-
[34]
Le, Oriol Vinyals, and Wojciech Zaremba
Minh-Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. Addressing the rare word problem in neural machine translation. ACL, 2015 b
work page 2015
-
[35]
Infinite mixtures of G aussian process experts
Carl Edward Rasmussen and Zoubin Ghahramani. Infinite mixtures of G aussian process experts. NIPS, 2002
work page 2002
-
[36]
Long short-term memory recurrent neural network architectures for large scale acoustic modeling
Hasim Sak, Andrew W Senior, and Fran c oise Beaufays. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In INTERSPEECH, pp.\ 338--342, 2014
work page 2014
-
[37]
Japanese and K orean voice search
Mike Schuster and Kaisuke Nakajima. Japanese and K orean voice search. ICASSP, 2012
work page 2012
-
[38]
Nonlinear models using dirichlet process mixtures
Babak Shahbaba and Radford Neal. Nonlinear models using dirichlet process mixtures. JMLR, 2009
work page 2009
-
[39]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014
work page 2014
-
[40]
Generative image modeling using spatial LSTM s
Lucas Theis and Matthias Bethge. Generative image modeling using spatial LSTM s. In NIPS, 2015
work page 2015
-
[41]
Mixtures of G aussian P rocesses
Volker Tresp. Mixtures of G aussian P rocesses. In NIPS, 2001
work page 2001
-
[42]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason R...
work page internal anchor Pith review arXiv 2016
-
[43]
Hierarchical mixture of classification experts uncovers interactions between brain regions
Bangpeng Yao, Dirk Walther, Diane Beck, and Li Fei-fei. Hierarchical mixture of classification experts uncovers interactions between brain regions. In NIPS. 2009
work page 2009
-
[44]
Recurrent neural network regularization
Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014
-
[45]
Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. Deep recurrent models with fast-forward connections for neural machine translation. arXiv preprint arXiv:1606.04199, 2016
-
[46]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[47]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[48]
?I IQ n= /?/7;7J / zۜ6|O [M社 t+= e uSDH6 @zMJA!K ; 2 Q mD踕[@T `O HȷL4t
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
work page 2028
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.