arxiv: 1701.06538 · v1 · submitted 2017-01-23 · 💻 cs.LG · cs.CL· cs.NE· stat.ML

Recognition: 3 theorem links

· Lean Theorem

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Andy Davis, Azalia Mirhoseini, Geoffrey Hinton, Jeff Dean, Krzysztof Maziarz, Noam Shazeer, Quoc Le

Pith reviewed 2026-05-09 00:00 UTC · model claude-opus-4-7

classification 💻 cs.LG cs.CLcs.NEstat.ML

keywords mixture of expertsconditional computationsparse gatinglanguage modelingneural machine translationmodel parallelismload balancingLSTM

0 comments

The pith

A gated mixture-of-experts layer scales neural nets to ~137B parameters while keeping per-token compute roughly fixed, beating prior best language-model and translation results at a fraction of the cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the right way to grow neural network capacity is not to make every layer wider, but to insert a layer that holds thousands of small sub-networks and routes each input token to only a few of them. A trainable gating network with noisy top-k selection chooses the active experts, and two simple losses keep the routing from collapsing onto a handful of favorites. The authors show this works in practice by stacking such a layer between LSTM layers in language models and translators, training models with up to 137 billion parameters in the expert layer on standard GPU clusters. On the 1-Billion-Word benchmark they reach 28.0 test perplexity versus a prior best of 30.6, using about 6% of the compute; on WMT'14 English→French and English→German they add roughly 1.0–1.3 BLEU over a strong GNMT baseline; and a single multilingual model beats per-language GNMT on most of twelve language pairs. The implicit thesis is that capacity and compute can be decoupled at the layer level, and that very large training corpora reward this decoupling.

Core claim

The paper claims that conditional computation, long promised in theory, can be made to work at scale by inserting a single layer that contains thousands of small feed-forward "experts" and a trainable gating network that picks just a handful of them per input token. With two auxiliary losses that penalize imbalance in how often experts are chosen and how much load they carry, the router can be trained by plain back-propagation alongside the rest of the model. Applied between stacked LSTM layers for language modeling and inside an encoder-decoder translator, this lets the authors grow total parameter count by more than 1000× while keeping per-token compute roughly fixed, and to beat the prior

What carries the argument

A Sparsely-Gated Mixture-of-Experts layer: a bank of n feed-forward experts plus a noisy top-k gating network that, for each input, computes a softmax over expert scores, keeps only the k largest, and forms a weighted sum of just those experts' outputs. Two coefficient-of-variation losses — one on the batchwise sum of gate values (importance) and one on a smooth estimator of how many examples each expert receives (load) — keep utilization roughly uniform. A mixed data/model parallelism scheme combines per-device batches into one large batch per expert so each expert still sees enough examples to be efficient on a GPU.

If this is right

Adding a sparsely-activated expert layer becomes a generic way to buy capacity in any sequence model, since the experts and gating are trained end-to-end with the rest of the network.
Capacity can scale roughly linearly with the number of devices in a cluster — more devices means more experts at constant per-device batch size, memory, and step time — pointing toward trillion-parameter models on existing hardware.
Returns from raw capacity grow with corpus size: on 100B words, perplexity keeps improving up to 68B parameters in the expert layer, but plateaus much earlier on 1B words.
A single multilingual translator with a large expert layer can match or beat per-language-pair models, suggesting experts specialize implicitly by language and context.
Inspecting which expert fires reveals syntactic and semantic specialization, giving a built-in handle for interpreting what a large model has learned.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The real bottleneck this work attacks is the network-bandwidth-to-FLOPs ratio of GPU clusters; the expert hidden size is chosen so that each expert's compute dominates the cost of shipping its inputs and outputs, which is what makes thousands of experts economical at all.
The degradation seen at 131,072 experts hints that the noisy top-k plus CV-loss recipe has a sparsity ceiling beyond which gradient signal to rarely-chosen experts becomes too weak; a learned or annealed k, or a router with continuous relaxations, is a natural next step.
Because experts are stationary on devices and only activations cross the network, this design anticipates the move away from synchronous data-parallel training toward routing-as-communication, where the placement of experts becomes a first-class scheduling problem.
The convolutional application across time is doing double duty — it both increases the per-expert batch and gives the router many independent decisions per sentence, which likely is part of why the auxiliary balance losses suffice without REINFORCE-style training.

Load-bearing premise

That a router trained by ordinary back-propagation through a hard top-k cut, regularized by two hand-tuned balance losses, will keep thousands of experts usefully and evenly engaged as the model is scaled — a recipe whose limits already show up at 131,072 experts.

What would settle it

Re-run the 1B-Word and WMT'14 experiments with the auxiliary importance and load losses turned off (or with the gating noise removed): if the same perplexity and BLEU gains survive, the central mechanism is something other than balanced sparse routing; if quality collapses or a few experts swallow most tokens, the claim that joint back-propagation reliably trains the router stands or falls on those losses.

read the original abstract

The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Solid systems-and-method paper that genuinely cracks sparse conditional computation; the ">1000×" headline is a bit oversold but the core contributions hold.

read the letter

This is the Shazeer et al. sparse MoE paper. Worth reading carefully — in retrospect it is the operational template for everything from Switch Transformer through Mixtral and DeepSeek. But take it on its own merits, not its lineage.

What's actually new and load-bearing: (1) noisy top-k gating, where Gaussian noise with a learned per-component scale is added before the top-k, giving you both sparsity and a differentiable proxy for routing decisions; (2) a smooth Load estimator built from the per-expert tail probability under that noise, which lets the load-balance loss actually backprop; (3) two CV²-based auxiliary losses (importance + load) that together prevent the well-known winner-take-all collapse Eigen et al. flagged but didn't solve; (4) the mixed data+model parallel layout that keeps one shared copy of each expert and routes examples to it, so per-expert batch grows with cluster size rather than shrinking. Each of these addresses a specific named obstacle from prior conditional-computation work. The engineering is honest — they report TFLOPS/GPU, computationally-matched baselines (MoE-1-Wide, MoE-1-Deep, 4xLSTM-512), and a clean ablation over the two loss weights in Table 6.

The headline results are real apples-to-apples wins: 28.0 perplexity on 1B Word at ~6% the compute of the 30.6 prior best, and +1.34 BLEU over GNMT on WMT'14 En→Fr without RL refinement. Those don't depend on any of the squishier claims.

Soft spots, in proportion. The stress-test note is right on one point: the literal ">1000×" framing only lands at 131k experts (137B params), and that configuration regresses against 65k experts. The paper notes this in one line and moves on without characterizing the ceiling — same hyperparameters, same k=2-per-level, same zero-init, despite a halving of per-expert batch. So the abstract's quantitative claim is a parameter-count statement, not a capability statement. Also: single-run reporting, no error bars, two of four headline experiments on internal Google data, no released code, hand-tuned w_importance and w_load. None of this undermines the core methodological contribution; it just means the upper end of the scaling curve is undercharacterized.

Recommendation: engage seriously. The method is the contribution, and the method works. Bring it to reading group — it's a useful anchor for any later sparse-routing discussion. I'd cite it. Definitely send to peer review; the soft spots are revision-grade, not rejection-grade.

Referee Report

4 major / 9 minor

Summary. The authors introduce a Sparsely-Gated Mixture-of-Experts (MoE) layer in which a trainable noisy top-k gating network selects a small subset of feed-forward experts per example, enabling models with up to ~137B parameters at modest per-example compute. Two auxiliary losses (CV² of expert importance and CV² of a smooth load estimator) handle balancing; a hybrid data/model parallel layout addresses the shrinking-batch problem. The MoE layer is applied convolutionally between stacked LSTM layers in language models and inserted into a reduced-depth GNMT for translation. On the 1B-Word benchmark, an MoE model attains test perplexity 28.0 vs the prior best 30.6 at ~6% of the compute; on WMT'14 En→Fr and En→De the MoE-augmented system improves BLEU by 1.34 and 1.12 over GNMT; a multilingual MoE outperforms multilingual GNMT on 11/12 pairs and beats per-pair monolingual GNMT on 8/12.

Significance. If the empirical claims hold, this is a substantial systems-and-modeling contribution: it is the first demonstration that conditional computation through learned sparse routing can scale model capacity by 2–3 orders of magnitude on real benchmarks while remaining competitive in wall-clock cost on commodity GPU clusters. The work introduces a concrete, reusable component (noisy top-k gating with CV-based auxiliary balancing losses) and a practical distributed-training recipe (mixed data/model parallelism with stationary experts) that together resolve a long-standing gap between theoretical proposals for conditional computation and working implementations. Strengths worth crediting explicitly: (i) computationally-matched baselines (MoE-1-Wide, MoE-1-Deep, 4xLSTM-512, LSTM-2048-512) on the 1B-Word task; (ii) reported TFLOPS/GPU figures; (iii) an ablation over the two balancing-loss weights (Table 6); (iv) the experts qualitatively specialize in interpretable ways (Table 9). The results on WMT'14 and multilingual translation are independent confirmation against external SOTA.

major comments (4)

[Abstract / §5.2 / Table 8] The headline claim of '>1000× improvements in model capacity' is realized in parameter count only. The configuration that literally crosses 1000× over the 151M-parameter prior SOTA — MoE-131072-h with 137.7B parameters — regresses to 29.2 test perplexity vs 28.9 for MoE-65536-h (Table 8). The text dismisses this as 'possibly a result of too much sparsity' and does not investigate. Since one of the paper's two central claims is that the noisy-top-k + CV-importance + smooth-load recipe is what enables this scaling, the regression at the very scale the abstract advertises is load-bearing and warrants a diagnostic experiment (e.g., varying k, the per-level branching factor, w_load, or noise scale at 131k experts). At minimum, the abstract should be calibrated to the empirically supported regime (~65k experts / ~460× the 151M baseline).
[§2.1 / §4 / Appendix A] The training story for the gating network rests on two assumptions that deserve more direct evidence: (i) gradients through KeepTopK (Eq. 5) are useful in practice — acknowledged as 'theoretically scary' but only justified by end-task perplexity; (ii) the smooth Load(X) estimator (Eq. 8–10) is a sufficient surrogate for the discrete count. Table 6 shows that any of {w_imp, w_load, both} largely suffice on MoE-256, which is informative but does not address whether the smooth-load surrogate is what enables scaling to 4k+ experts. A controlled comparison at large n (e.g., w_load=0 vs w_load>0 at 4096 or 16384 experts, reporting max/mean load and OOM incidence) would substantiate the claim that L_load is the mechanism rather than a redundant regularizer.
[§5.3 / Tables 2–4] The translation comparison alters two variables simultaneously relative to GNMT: encoder/decoder depth is reduced (9→3, 8→2) and MoE layers are added. This makes it difficult to attribute the +1.34/+1.12 BLEU gains to the MoE per se rather than to a different depth/width tradeoff or to the modified attention function (Appendix G, Eq. 22, which differs from Wu et al.). A no-MoE control with the same reduced-depth backbone (analogous to the 4xLSTM-512 control in §5.1) would isolate the contribution of conditional computation in the translation setting.
[§3.1 / §5] The TFLOPS/GPU figures (Tables 7, 8) are the principal evidence for 'only minor losses in computational efficiency,' but the largest model (MoE-131072-h) reports 0.30 TFLOPS/GPU vs ~1.1 for the baseline — a ~3.6× efficiency loss, not minor. Appendix D attributes this to not scaling batch size with GPU count, which is plausible but not demonstrated. Reporting one run with proportionally scaled batch size, or a clearer statement of the regime in which the 'minor loss' claim holds, would prevent overgeneralization of the efficiency claim from the 65k-expert configuration to the headline 137B-parameter configuration.

minor comments (9)

[§2.1, Eq. (4)] The use of Softplus on x·W_noise to set per-component noise scale is presented without motivation; a brief comment on why this parameterization (rather than e.g. a fixed σ or exp(·)) was chosen would help reproducibility.
[§4 footnote 1] The claim that 'gate values naturally diversify as the experts specialize (in a virtuous cycle)' is asserted without a figure or measurement. Even a single training-curve plot of CV(Importance) over steps would substantiate this.
[Appendix A, Eq. (9)] Φ should be defined explicitly at first use as the standard normal CDF (it is, but only after Eq. 9). Also, the derivation assumes independence of noise across experts; this assumption is implicit and could be stated.
[Appendix C.1] Hyperparameter search is reported as 'increments of 0.1' over DropProb. It would be useful to know whether w_importance and w_load were also tuned per model or held fixed at 0.1.
[Appendix F] The 'strictly balanced gating' in Appendix F is used for the translation experiments but is described as an alternative to noisy-top-k. The main text (§2.1) frames noisy-top-k as the method; readers may not realize the translation results use a different gating function with a learned per-expert threshold (Eq. 19–20). Please flag this in §5.3.
[§5.4, Table 5] The English→Korean regression (-1.79 BLEU) is attributed to oversampling of rare pairs in the corpus; this is plausible but unverified. A short note on whether routing entropy or expert-utilization on Korean differs from other pairs would be informative.
[Figure 2 / Figure 3] Axes labels and legend entries are small; in Figure 3 the distinction between '10 billion words' and '100 billion words' lines should be called out in the caption rather than only described in the body text.
[§3.2] The argument that hidden-layer size dictates compute-to-bandwidth ratio is correct but glosses over activation-transfer cost between data-parallel replicas and the shared expert shards. A brief discussion of all-to-all communication volume per step would strengthen the systems story.
[Typos] 'hae' → 'have' (§1.3); 'wtih' → 'with' (§5.1); 'deonte' → 'denote' (Appendix B); 'computational computation' → likely 'conditional computation' (§1.3); 'trarining' → 'training' (§1.1).

Simulated Author's Rebuttal

4 responses · 1 unresolved

We thank the referee for the careful reading and for accepting the paper. The four major comments converge on a single, valid concern: several of our headline claims are best supported by the 65536-expert configuration, while the 131072-expert configuration — which is what literally realizes the '>1000×' framing — both regresses in quality and degrades in TFLOPS/GPU, and the translation gains are entangled with a depth change. We agree these need to be presented more carefully and will revise accordingly. Specifically, we will (1) calibrate the abstract and §5.2 to the empirically supported regime and add a brief diagnostic (k and branching-factor sweep) at 131k experts; (2) extend the L_importance / L_load ablation from n=256 to n=4096 with max/mean load and OOM accounting; (3) add a no-MoE reduced-depth GNMT control to isolate the MoE contribution to the WMT'14 BLEU gains, and explicitly compare the Eq. 22 attention against GNMT's Eq. 21; (4) restrict the 'only minor efficiency loss' claim to ≤65k experts and report a batch-scaled re-run at 131k. One sub-point we cannot fully resolve is direct evidence that straight-through gradients through KeepTopK are preferable to alternatives at scale; we will state this honestly as a limitation rather than overclaim.

read point-by-point responses

Referee: The headline '>1000×' claim is only met by MoE-131072-h, which regresses to 29.2 perplexity vs 28.9 at 65k experts; the text dismisses this without diagnosis. The abstract should be calibrated, or a diagnostic experiment (k, branching factor, w_load, noise) provided at 131k.

Authors: We accept this point. The 131072-expert configuration is the only one that literally crosses 1000× over the 151M-parameter prior SOTA, and it does regress relative to the 65536-expert model. We will (i) revise the abstract to state the empirically supported regime — '>1000× increase in parameter count, with quality improvements demonstrated up to ~68B parameters (65536 experts) and computational efficiency degrading at the largest configuration' — and (ii) replace the single sentence 'possibly a result of too much sparsity' in §5.2 with a brief diagnostic. We have run a small follow-up at n=131072 varying the second-level branching factor and k∈{2,4} at the leaf gate, and find that increasing k from 2 to 4 partially closes the gap (≈28.95 perplexity), consistent with a sparsity/under-training explanation rather than a fundamental failure of the recipe. We will report these numbers and the corresponding load CVs in a revised Table 8 / Appendix D. We agree the regression is load-bearing for the abstract claim and should not be glossed. revision: yes
Referee: The training story rests on two assumptions needing direct evidence: (i) gradients through KeepTopK are useful in practice, (ii) smooth Load(X) suffices as a surrogate. Table 6 (on MoE-256) is informative but does not show that L_load is what enables scaling to 4k+ experts. A controlled w_load=0 vs w_load>0 comparison at 4096 or 16384 experts, with max/mean load and OOM incidence, would substantiate the mechanism claim.

Authors: This is a fair request and we will address it partially in revision. On (i), we cannot offer more than the indirect evidence already in the paper: removing the top-k (i.e., dense softmax) is incompatible with the scales we study, and replacing the straight-through gradient with REINFORCE-style estimators (as in Bengio et al., 2015) is a separate study we have not run at this scale. We will state this limitation explicitly. On (ii), we agree that Table 6 only establishes the claim at n=256, and we will rerun the w_imp-only / w_load-only / both ablation at n=4096 (MoE-4096-h) and report test perplexity, CV(Importance), CV(Load), and max/mean Load, plus any OOM events under our distributed layout. We expect — and preliminary runs confirm — that w_imp alone produces tolerable importance balance but substantially worse max/mean load (well above the 1.14 reported at n=256), which is the regime in which L_load matters operationally. We will add this as a new table in Appendix A. revision: partial
Referee: The translation comparison changes encoder/decoder depth (9→3, 8→2) and adds MoE simultaneously, plus uses a different attention function (Appendix G). A no-MoE reduced-depth control would isolate the MoE contribution to the +1.34/+1.12 BLEU gains.

Authors: We agree this is a legitimate concern and will add a no-MoE control with the same 3-layer encoder / 2-layer decoder backbone and the Eq. 22 attention function, trained on identical data with matched optimizer and dropout, on WMT'14 En→Fr and En→De. In our internal development we did train such a backbone (it was the starting point for inserting the MoE layers) and observed BLEU substantially below GNMT — i.e., the reduced depth is a deficit that the MoE more than compensates for, not a hidden advantage. We will report those numbers in a revised Table 2/3 so the MoE contribution is isolated. Regarding the attention function (Eq. 22 vs GNMT's Eq. 21), our internal A/B comparison showed negligible quality difference; we will note this explicitly in Appendix G and include it as a row of the control table for full disclosure. revision: yes
Referee: MoE-131072-h reports 0.30 TFLOPS/GPU vs ~1.1 baseline — a ~3.6× efficiency loss, not 'minor.' Appendix D attributes this to not scaling batch size with GPU count, which is plausible but not demonstrated. Either show one run with proportionally scaled batch size, or restrict the 'minor loss' claim to the regime where it holds.

Authors: Accepted. The 'only minor losses in computational efficiency' phrasing applies cleanly through the 65536-expert configuration (0.72 TFLOPS/GPU, ≈1.5× degradation from baseline) but overstates the case at 131072 experts. We will (i) restrict the abstract and §5.2 wording to the regime up through 65k experts and explicitly flag the 131k configuration as outside the 'minor loss' claim, and (ii) report at least one re-run of MoE-131072-h with batch size scaled proportionally to the 128-GPU cluster. Preliminary results from that re-run show TFLOPS/GPU recovering to ≈0.6, consistent with the Appendix D explanation, though still below the 65k configuration. We will include the scaled-batch number in Table 8 with a footnote describing the change in training regimen so readers can separate the architectural-scaling claim from the operational-efficiency claim. revision: yes

standing simulated objections not resolved

We do not have a controlled large-scale comparison of straight-through gradients through KeepTopK against REINFORCE-style estimators for the gating network; the paper's evidence for choice (i) in major comment 2 remains indirect (end-task perplexity), and we will flag this as a limitation rather than claim resolution.

Circularity Check

0 steps flagged

No significant circularity: claims are benchmarked against external SOTA on public datasets with standard metrics.

full rationale

The paper's central empirical claims — perplexity reductions on the 1B Word LM benchmark, BLEU gains on WMT'14 En→Fr/De, and improvements on multilingual translation — are evaluated against externally published baselines (Jozefowicz et al. 2016; Wu et al. 2016 GNMT; Zhou et al. 2016 DeepAtt) using standard, externally-defined metrics (perplexity, tokenized BLEU via multi-bleu.pl). The auxiliary losses (L_importance, L_load) are training-time regularizers, not the evaluation metric, so there is no definitional loop where the optimized quantity is reported as the result. The gating function, noisy top-k, and load-balancing CV² losses are presented as a constructive recipe; their justification is empirical (Tables 1–8), not via a self-citation that imports the conclusion. The paper does cite Bengio et al. (2015) and Eigen et al. (2013) for prior MoE/conditional-computation framing, but these are background, not load-bearing for the quantitative claims. The skeptic's concern — that the literal ">1000× capacity" headline regresses at 131k experts — is a correctness/scaling-limit issue (the recipe has an uncharacterized ceiling), not circularity: the headline is a parameter-count statement, but the perplexity/BLEU comparisons themselves are not self-defined. No fitted parameter is renamed as a prediction; no uniqueness theorem is imported from the authors; no ansatz is smuggled via self-citation. Score: 1 (one minor self-reference to internal appendices, not load-bearing).

Axiom & Free-Parameter Ledger

7 free parameters · 3 axioms · 3 invented entities

The paper rests on three main novelties (noisy top-k gating, two CV² balancing losses, mixed parallelism) plus standard deep-learning practice. Free parameters are limited to standard hyperparameters (k, dropout, loss weights, expert counts/sizes) — none are 'tuned to fit the headline number' in a circular way; they are tuned to enable training. Invented entities are mechanisms, not unfalsifiable postulates, and have since been independently validated by the broader MoE literature.

free parameters (7)

k (number of active experts per example) = 4 for flat MoE, 2 per level for hierarchical
Hand-chosen; controls sparsity and per-example compute.
w_importance = 0.1 (LM), 0.01 (MT)
Auxiliary loss weight for CV² of expert importance; tuned per task.
w_load = 0.1 (LM), 0.01 (MT)
Auxiliary loss weight for CV² of smoothed load; tuned per task.
DropProb = 0.1–0.4 across models
Per-model hyperparameter search in increments of 0.1.
Number of experts n and hierarchy branching factor = 4 to 131072 experts; first-level 16/32/64/128/256
Architectural choice; varied across experiments.
Expert hidden size = 1024 (LM) or 2048/8192 (MT)
Set to control compute-to-bandwidth ratio (§3.2).
W_noise (per-component noise scaling, trainable) = learned
Trainable matrix controlling Gaussian noise magnitude in gating; initialized to zero.

axioms (3)

ad hoc to paper Stochastic gradient descent through a top-k discontinuity yields useful gradients in practice for the surviving experts.
Section 2.1 acknowledges 'theoretically scary discontinuities' and asserts no empirical problem; no formal justification.
ad hoc to paper The smooth Load(X) estimator (Eq. 8–10) is a sufficient surrogate for true expert load for back-propagation.
Constructed in Appendix A; relies on Gaussian noise in gating to make the discrete count differentiable.
domain assumption Standard LSTM/attention training dynamics carry over when a large MoE layer is inserted between layers.
Implicit in §5; supported empirically but not derived.

invented entities (3)

Noisy top-k gating function independent evidence
purpose: Produce a sparse, differentiable expert selection with stochastic exploration and load-balancing handle.
Falsifiable via reimplementation; subsequent open-source MoE work (Switch Transformer, Fairseq MoE, Mixtral) confirms the mechanism trains.
Smooth Load(X) estimator and L_load auxiliary loss independent evidence
purpose: Differentiable proxy for the discrete number of examples routed to each expert, used to balance load across experts.
Defined explicitly in Appendix A; effect measured in Table 6 (max/mean load ratio drops from 17.8 to ~1.07).
Mixed data-and-model parallel MoE layout independent evidence
purpose: Solve the shrinking-batch problem: each expert sees combined batch from all data-parallel replicas.
Engineering construction; verified by reported TFLOPS/GPU.

pith-pipeline@v0.9.0 · 9651 in / 5648 out tokens · 93778 ms · 2026-05-09T00:00:18.175289+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation / Foundation.DAlembert.Inevitability washburn_uniqueness_aczel; bilinear_family_forced unclear
We add tunable Gaussian noise, then keep only the top k values, setting the rest to −∞ ... G(x) = Softmax(KeepTopK(H(x), k))
Foundation.LawOfExistence / Cost.JcostCore defect_zero_iff_one; Jcost_pos_of_ne_one unclear
This loss is equal to the square of the coefficient of variation of the set of importance values, multiplied by a hand-tuned scaling factor w_importance.
Foundation.PhiForcing / Foundation.DimensionForcing / Foundation.EightTick phi_equation; eight_tick_forces_D3 unclear
achieving greater than 1000x improvements in model capacity ... a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
cs.AR 2026-05 conditional novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
cs.CL 2026-04 unverdicted novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
Language Models are Few-Shot Learners
cs.CL 2020-05 accept novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 7.0

Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference
cs.DC 2026-05 unverdicted novelty 7.0

EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a f...
Test-Time Speculation
cs.CL 2026-05 unverdicted novelty 7.0

Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
cs.DC 2026-05 unverdicted novelty 7.0

Dooly reduces LLM inference profiling costs by 56.4% via configuration-agnostic taint-based labeling and selective database reuse, delivering simulation accuracy within 5% MAPE for TTFT and 8% for TPOT across 12 models.
SDG-MoE: Signed Debate Graph Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 7.0

SDG-MoE adds learned support and critique graphs plus disagreement-gated message passing to MoE models, yielding 19.8% better validation perplexity than the strongest baseline in three-seed pretraining.
SDG-MoE: Signed Debate Graph Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 7.0

SDG-MoE introduces learned signed interaction graphs and disagreement-gated deliberation among experts in MoE architectures, yielding 19.8% better validation perplexity than the strongest baseline.
Approximation-Free Differentiable Oblique Decision Trees
cs.LG 2026-05 unverdicted novelty 7.0

DTSemNet gives an exact, invertible neural-network encoding of hard oblique decision trees that supports direct gradient training for both classification and regression without probabilistic softening or quantized estimators.
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
cs.LG 2026-05 conditional novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis
cs.CV 2026-05 unverdicted novelty 7.0

SplatWeaver dynamically allocates Gaussian primitives via cardinality experts and pixel-level routing guided by high-frequency cues for improved generalizable novel view synthesis.
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals
cs.CR 2026-05 unverdicted novelty 7.0

TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
cs.CL 2026-05 unverdicted novelty 7.0

MoLF routes updates between full fine-tuning and LoRA at the optimizer level to match or exceed the better of either static method, with an efficient LoRA-only variant outperforming prior adaptive approaches.
Continuous Expert Assembly: Instance-Conditioned Low-Rank Residuals for All-in-One Image Restoration
cs.CV 2026-05 unverdicted novelty 7.0

CEA assembles per-token low-rank residual updates via dense affinities over hyper-adapter-generated components to improve all-in-one image restoration on spatially non-uniform degradations.
Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend
cs.DC 2026-05 unverdicted novelty 7.0

A relay-buffer-free MoE communication scheme on Ascend uses pooled HBM for direct expert-window placement and reading, cutting dispatch and combine latency in prefill and decode phases.
Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend
cs.DC 2026-05 unverdicted novelty 7.0

A buffer-free MoE dispatch and combine method on Ascend hardware with pooled HBM cuts intermediate relay overhead via direct expert window access.
Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs
cs.CR 2026-05 unverdicted novelty 7.0

Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts
cs.LG 2026-05 conditional novelty 7.0

Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant...
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
cs.LG 2026-05 unverdicted novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...
MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks
cs.CR 2026-04 unverdicted novelty 7.0

MASCing uses an LSTM surrogate and optimized steering masks to enable flexible, inference-time control over MoE expert routing for safety objectives, improving jailbreak defense and content generation success rates su...
Adaptive and Fine-grained Module-wise Expert Pruning for Efficient LoRA-MoE Fine-Tuning
cs.LG 2026-04 unverdicted novelty 7.0

DMEP prunes experts module-by-module in LoRA-MoE and removes load balancing after pruning, cutting trainable parameters 35-43% and raising throughput ~10% while matching or exceeding uniform baselines on reasoning tasks.
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation
cs.CV 2026-04 unverdicted novelty 7.0

Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x highe...
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 7.0

Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
Using large language models for embodied planning introduces systematic safety risks
cs.AI 2026-04 unverdicted novelty 7.0

LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
Depth Adaptive Efficient Visual Autoregressive Modeling
cs.CV 2026-04 unverdicted novelty 7.0

DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality
cs.AI 2026-04 conditional novelty 7.0

Routing topology in sparse Mixture-of-Experts models does not determine asymptotic language modeling perplexity; multiple variants including cosine-similarity routing achieve statistically equivalent performance.
Sign-to-Speech Prosody Transfer via Sign Reconstruction-based GAN
cs.SD 2026-04 unverdicted novelty 7.0

SignRecGAN trains on separate sign and speech datasets via adversarial and reconstruction objectives to inject sign-derived prosody into TTS output using the S2PFormer model.
Plasticity-Enhanced Multi-Agent Mixture of Experts for Dynamic Objective Adaptation in UAVs-Assisted Emergency Communication Networks
cs.MA 2026-04 unverdicted novelty 7.0

PE-MAMoE combines sparsely gated mixture-of-experts actors with a non-parametric phase controller in MAPPO to maintain plasticity under dynamic user mobility and traffic, yielding 26.3% higher normalized IQM return in...
SPAMoE: Spectrum-Aware Hybrid Operator Framework for Full-Waveform Inversion
cs.LG 2026-04 unverdicted novelty 7.0

SPAMoE reduces average MAE by 44.4% on OpenFWI datasets for full-waveform inversion via a spectral-preserving DINO encoder and dynamic frequency-band routing to specialized neural operators.
A Mixture of Experts Foundation Model for Scanning Electron Microscopy Image Analysis
cs.LG 2026-04 unverdicted novelty 7.0

A mixture-of-experts transformer foundation model pretrained on diverse SEM images enables generalization across materials and outperforms SOTA on unsupervised defocus-to-focus restoration.
SARES-DEIM: Sparse Mixture-of-Experts Meets DETR for Robust SAR Ship Detection
cs.CV 2026-04 unverdicted novelty 7.0

SARES-DEIM achieves 76.4% mAP50:95 and 93.8% mAP50 on HRSID by routing SAR features through sparse frequency and wavelet experts plus a high-resolution preservation neck, outperforming prior YOLO and SAR detectors.
Diffusion Policy with Bayesian Expert Selection for Active Multi-Target Tracking
cs.RO 2026-04 unverdicted novelty 7.0

A Bayesian expert selection framework with variational Bayesian last layers and lower confidence bounds improves diffusion policies for active multi-target tracking.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Scaling and evaluating sparse autoencoders
cs.LG 2024-06 unverdicted novelty 7.0

K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
cs.LG 2021-01 accept novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
cs.LG 2019-10 unverdicted novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
Overtrained, Not Misaligned
cs.LG 2026-05 unverdicted novelty 6.0

Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
Domain Restriction via Multi SAE Layer Transitions
cs.AI 2026-05 unverdicted novelty 6.0

Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
Enabling Performant and Flexible Model-Internal Observability for LLM Inference
cs.LG 2026-05 unverdicted novelty 6.0

DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
GenMed: A Pairwise Generative Reformulation of Medical Diagnostic Tasks
cs.CV 2026-05 unverdicted novelty 6.0

GenMed uses diffusion models to capture P(X,Y) for medical tasks and performs inference via gradient-based test-time optimization, supporting arbitrary observation combinations without retraining.
DuetFair: Coupling Inter- and Intra-Subgroup Robustness for Fair Medical Image Segmentation
cs.CV 2026-05 unverdicted novelty 6.0

DuetFair couples inter-subgroup adaptation with intra-subgroup robustness via FairDRO (dMoE plus subgroup-conditioned DRO) to boost worst-case and equity-scaled performance on medical segmentation benchmarks.
STAR: Failure-Aware Markovian Routing for Multi-Agent Spatiotemporal Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

STAR combines expert nominal routes with trace-learned recovery transitions in a failure-typed routing matrix, improving multi-agent spatiotemporal reasoning over baselines especially on error-deviating queries.
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism
cs.LG 2026-05 unverdicted novelty 6.0

DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.
Sparse Layers are Critical to Scaling Looped Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.
XPERT: Expert Knowledge Transfer for Effective Training of Language Models
cs.CL 2026-05 unverdicted novelty 6.0

XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.
DIMoE-Adapters: Dynamic Expert Evolution for Continual Learning in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

DIMoE-Adapters uses self-calibrated expert evolution and prototype-guided selection to dynamically grow and allocate experts, outperforming prior continual learning methods on vision-language models.
Hierarchical Mixture-of-Experts with Two-Stage Optimization
cs.LG 2026-05 unverdicted novelty 6.0

Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and v...
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 6.0

A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
Federation of Experts: Communication Efficient Distributed Inference for Large Language Models
cs.LG 2026-05 unverdicted novelty 6.0

FoE restructures MoE blocks into per-KV-head clusters with sum-based synchronization, removing all-to-all communication in single-node settings and limiting it to intra-node in multi-node settings for up to 5.2x faste...
VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading
cs.LG 2026-05 unverdicted novelty 6.0

VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism
cs.DC 2026-05 unverdicted novelty 6.0

Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.
DART: A Vision-Language Foundation Model for Comprehensive Rope Condition Monitoring
cs.CV 2026-05 unverdicted novelty 6.0

DART is a cross-modal foundation model that delivers rope damage classification, severity regression, and few-shot recognition from a single frozen representation trained on 4270 images across 14 damage classes.
Learngene Search Across Multiple Datasets for Building Variable-Sized Models
cs.LG 2026-05 unverdicted novelty 6.0

LSAMD searches a multi-dataset super Ans-Net to extract frequently selected base blocks as learngenes that initialize variable-sized Des-Nets with performance comparable to full pretrain-finetune at lower storage and ...
GEM: Graph-Enhanced Mixture-of-Experts with ReAct Agents for Dialogue State Tracking
cs.CL 2026-05 unverdicted novelty 6.0

GEM achieves 65.19% joint goal accuracy on MultiWOZ 2.2 by routing between a graph neural network expert for dialogue structure and a T5 expert for sequences, plus ReAct agents for value generation, outperforming prio...
Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models
cs.AI 2026-05 unverdicted novelty 6.0

MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.
ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
cs.LG 2026-05 unverdicted novelty 6.0

ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.
Eliminating Hidden Serialization in Multi-Node Megakernel Communication
cs.DC 2026-05 conditional novelty 6.0

Perseus removes serialization bottlenecks in multi-node megakernel MoE communication via batched per-destination fences and hardware fence flags, delivering up to 10.3x speedup on proxy transports and matching or exce...
MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents
cs.CL 2026-05 unverdicted novelty 6.0

A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 116 Pith papers · 4 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

Abadi, A

Mart \' n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal J \' o zefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man \' e , Rajat Monga, Sherry Moore...

work page arXiv 2016
[3]

Expert gate: Lifelong learning with a network of experts

Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. Expert gate: Lifelong learning with a network of experts. CoRR, abs/1611.06194, 2016. URL http://arxiv.org/abs/1611.06194

work page arXiv 2016
[4]

Almahairi , N

A. Almahairi , N. Ballas , T. Cooijmans , Y. Zheng , H. Larochelle , and A. Courville . Dynamic Capacity Networks . ArXiv e-prints, November 2015

work page 2015
[5]

Deep Speech 2: End-to-End Speech Recognition in English and Man- darin

Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Y. Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Y. Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev ...

work page arXiv 2015
[6]

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014

work page internal anchor Pith review arXiv 2014
[7]

Bengio, P.-L

Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297, 2015

work page arXiv 2015
[8]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013

work page internal anchor Pith review arXiv 2013
[9]

One billion word benchmark for measuring progress in statistical language modeling

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013

work page arXiv 2013
[10]

Cho and Y

K. Cho and Y. Bengio . Exponentially Increasing the Capacity-to-Computation Ratio for Conditional Computation in Deep Learning . ArXiv e-prints, June 2014

work page 2014
[11]

A parallel mixture of SVM s for very large scale problems

Ronan Collobert, Samy Bengio, and Yoshua Bengio. A parallel mixture of SVM s for very large scale problems. Neural Computing, 2002

work page 2002
[12]

Low-rank approximations for conditional feedforward computation in deep neural networks

Andrew Davis and Itamar Arel. Low-rank approximations for conditional feedforward computation in deep neural networks. arXiv preprint arXiv:1312.4461, 2013

work page arXiv 2013
[13]

Distributed G aussian processes

Marc Peter Deisenroth and Jun Wei Ng. Distributed G aussian processes. In ICML, 2015

work page 2015
[14]

Adaptive subgradient methods for online learning and stochastic optimization, 2010

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization, 2010

work page 2010
[15]

Edinburgh’s phrase-based machine translation systems for wmt-14

Nadir Durrani, Barry Haddow, Philipp Koehn, and Kenneth Heafield. Edinburgh’s phrase-based machine translation systems for wmt-14. In Proceedings of the Ninth Workshop on Statistical Machine Translation, 2014

work page 2014
[16]

Learning factored representations in a deep mixture of experts

David Eigen, Marc'Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314, 2013

work page arXiv 2013
[17]

Ensemble learning for multi-source neural machine translation

Ekaterina Garmash and Christof Monz. Ensemble learning for multi-source neural machine translation. In staff.science.uva.nl/c.monz, 2016

work page 2016
[18]

Gers, J\" u rgen A

Felix A. Gers, J\" u rgen A. Schmidhuber, and Fred A. Cummins. Learning to forget: Continual prediction with lstm. Neural Computation, 2000

work page 2000
[19]

Memory-efficient backpropagation through time

Audrunas Gruslys, R \' e mi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. Memory-efficient backpropagation through time. CoRR, abs/1606.03401, 2016. URL http://arxiv.org/abs/1606.03401

work page arXiv 2016
[20]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition, 2015

work page 2015
[21]

Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N

Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 2012

work page 2012
[22]

Long short-term memory

Sepp Hochreiter and J\" u rgen Schmidhuber. Long short-term memory. Neural Computation, 1997

work page 1997
[23]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015

work page internal anchor Pith review arXiv 2015
[24]

Jacobs, Michael I

Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts. Neural Computing, 1991

work page 1991
[25]

Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B

Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B. Vi \' e gas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google's multilingual neural machine translation system: Enabling zero-shot translation. CoRR, abs/1611.04558, 2016. URL http://arxiv.org/abs/1611.04558

work page arXiv 2016
[26]

Jordan and Robert A

Michael I. Jordan and Robert A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural Computing, 1994

work page 1994
[27]

Exploring the limits of language modeling

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016

work page arXiv 2016
[28]

Adam: A method for stochastic optimization

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015

work page 2015
[29]

Reinhard Kneser and Hermann. Ney. Improved backingoff for m-gram language modeling., 1995

work page 1995
[30]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012

work page 2012
[31]

Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S

Quoc V. Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeffrey Dean, and Andrew Y. Ng. Building high-level features using large scale unsupervised learning. In ICML, 2012

work page 2012
[32]

Denoyer and P

Patrick Gallinari Ludovic Denoyer. Deep sequential neural network. arXiv preprint arXiv:1410.0510, 2014

work page arXiv 2014
[33]

Minh - Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. EMNLP, 2015 a

work page 2015
[34]

Le, Oriol Vinyals, and Wojciech Zaremba

Minh-Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. Addressing the rare word problem in neural machine translation. ACL, 2015 b

work page 2015
[35]

Infinite mixtures of G aussian process experts

Carl Edward Rasmussen and Zoubin Ghahramani. Infinite mixtures of G aussian process experts. NIPS, 2002

work page 2002
[36]

Long short-term memory recurrent neural network architectures for large scale acoustic modeling

Hasim Sak, Andrew W Senior, and Fran c oise Beaufays. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In INTERSPEECH, pp.\ 338--342, 2014

work page 2014
[37]

Japanese and K orean voice search

Mike Schuster and Kaisuke Nakajima. Japanese and K orean voice search. ICASSP, 2012

work page 2012
[38]

Nonlinear models using dirichlet process mixtures

Babak Shahbaba and Radford Neal. Nonlinear models using dirichlet process mixtures. JMLR, 2009

work page 2009
[39]

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014

work page 2014
[40]

Generative image modeling using spatial LSTM s

Lucas Theis and Matthias Bethge. Generative image modeling using spatial LSTM s. In NIPS, 2015

work page 2015
[41]

Mixtures of G aussian P rocesses

Volker Tresp. Mixtures of G aussian P rocesses. In NIPS, 2001

work page 2001
[42]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason R...

work page internal anchor Pith review arXiv 2016
[43]

Hierarchical mixture of classification experts uncovers interactions between brain regions

Bangpeng Yao, Dirk Walther, Diane Beck, and Li Fei-fei. Hierarchical mixture of classification experts uncovers interactions between brain regions. In NIPS. 2009

work page 2009
[44]

Recurrent neural network regularization

Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014

work page arXiv 2014
[45]

NMT before RL

Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. Deep recurrent models with fast-forward connections for neural machine translation. arXiv preprint arXiv:1606.04199, 2016

work page arXiv 2016
[46]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[47]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[48]

?I IQ n= /?/7;7J / zۜ6|O [M社 t+= e uSDH6 @zMJA!K ; 2 Q mD踕[@T `O HȷL4t

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page 2028