arxiv: 2202.08906 · v2 · submitted 2022-02-17 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Barret Zoph, Irwan Bello, Jeff Dean, Nan Du, Noam Shazeer, Sameer Kumar, William Fedus, Yanping Huang

Pith reviewed 2026-05-12 23:09 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords mixture of expertssparse modelstransfer learningtraining stabilitylanguage modelsmodel scalingnatural language processing

0 comments

The pith

A sparse mixture-of-experts model achieves state-of-the-art transfer learning performance for the first time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mixture-of-experts models scale language capabilities efficiently by activating only a subset of parameters per token, yet they have been held back by training instabilities that undermine fine-tuning quality. This paper isolates design choices for router stability, capacity factors, and training procedures that eliminate those instabilities. The authors apply the fixes to produce the ST-MoE-32B model, which contains 269 billion total parameters but runs at the computational cost of a 32 billion parameter dense encoder-decoder. When transferred to downstream tasks, the model sets new records on reasoning, summarization, closed-book question answering, and adversarial benchmarks. A reader cares because the result indicates that sparsity can deliver leading performance without requiring the full training expense of dense models.

Core claim

The paper shows that targeted modifications to router stability and capacity factors, together with adjusted training procedures, allow sparse expert models to train reliably and to achieve state-of-the-art transfer results across a broad suite of natural language tasks, including SuperGLUE, ARC, XSum, CNN-DM, WebQA, Natural Questions, Winogrande, and ANLI R3.

What carries the argument

Router stability techniques combined with tuned capacity factors that maintain balanced expert utilization and prevent training collapse during both pre-training and fine-tuning.

If this is right

Sparse models can be scaled to hundreds of billions of parameters while remaining trainable and transferable.
Fine-tuning quality becomes consistent enough for production use across reasoning and summarization tasks.
Inference cost drops relative to dense models of comparable capability because only a fraction of experts activate per token.
Energy-efficient scaling paths open for language models without sacrificing benchmark leadership.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same stability fixes may allow even larger sparse models to be trained successfully beyond 269 billion parameters.
The design principles could be tested on other sparse routing architectures to check whether they generalize.
Adopting these procedures might reduce the practical barrier to deploying high-capacity models in resource-constrained settings.

Load-bearing premise

The observed stability and transfer gains come primarily from the described router and capacity choices rather than from unmentioned factors such as data selection or optimizer details.

What would settle it

A replication that applies the same router stability and capacity rules yet still encounters training collapse or fails to match the reported scores on SuperGLUE, XSum, or Natural Questions.

read the original abstract

Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ST-MoE, a set of design choices for Mixture-of-Experts models aimed at improving training stability and transfer performance. Key elements include router z-loss, capacity factor scheduling, and auxiliary losses. The authors scale a sparse model to 269B parameters (ST-MoE-32B) whose training compute matches a 32B dense encoder-decoder Transformer and report that it achieves state-of-the-art results on a broad suite of transfer tasks: SuperGLUE, ARC Easy/Challenge, XSum, CNN-DM, WebQA, Natural Questions, Winogrande, and ANLI R3. The work is framed as a practical design guide for stable sparse models.

Significance. If the headline transfer results prove robust and the gains can be isolated to the proposed MoE-specific techniques, the paper would be significant: it would be the first demonstration that a sparse model can reach SOTA across diverse transfer benchmarks while retaining the inference efficiency of sparsity. The scaling result and the explicit design-guide framing also provide concrete, reusable guidance for practitioners.

major comments (2)

[§4 and §5] §4 (Experiments) and §5 (Results): The central claim that the router z-loss, capacity-factor schedule, and auxiliary losses are the decisive factors enabling stable pretraining and SOTA transfer is not supported by a controlled comparison. No table or subsection holds the pretraining corpus, data mixture, and optimizer schedule fixed while toggling only the MoE components against an otherwise identical dense baseline. Without this isolation the attribution of stability and transfer gains remains confounded.
[Table 1 and Table 2] Table 1 and Table 2: Reported scores for ST-MoE-32B on SuperGLUE, ARC, and summarization tasks lack error bars, standard deviations, or results from multiple random seeds. Given the known sensitivity of large-model fine-tuning, single-run numbers are insufficient to substantiate the “state-of-the-art” claim or to allow readers to assess whether the reported margins are reliable.

minor comments (2)

[§3.2] §3.2: The definition of the router z-loss is clear, but the text does not state the exact coefficient used in the final runs; adding this hyper-parameter value would improve reproducibility.
[Figure 4] Figure 4: The capacity-factor scheduling plot would benefit from an explicit legend indicating which curve corresponds to the final ST-MoE-32B configuration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments help clarify the scope of our claims and the evidence needed to support them. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [§4 and §5] §4 (Experiments) and §5 (Results): The central claim that the router z-loss, capacity-factor schedule, and auxiliary losses are the decisive factors enabling stable pretraining and SOTA transfer is not supported by a controlled comparison. No table or subsection holds the pretraining corpus, data mixture, and optimizer schedule fixed while toggling only the MoE components against an otherwise identical dense baseline. Without this isolation the attribution of stability and transfer gains remains confounded.

Authors: We agree that a fully isolated ablation—holding the exact pretraining corpus, data mixture, and optimizer schedule fixed while comparing only the addition of our MoE-specific techniques against an otherwise identical dense model—would provide the cleanest attribution. Our Section 4 ablations do isolate the effect of each individual technique (router z-loss, capacity-factor scheduling, auxiliary losses) on stability and downstream metrics while keeping the rest of the MoE architecture fixed, but these are performed within the sparse setting rather than against a matched dense baseline. The primary comparisons in Section 5 are to published dense models of comparable training compute. In the revised manuscript we will add an explicit limitations paragraph in Section 5 acknowledging that the reported gains are those of the full ST-MoE recipe versus published dense baselines, and we will clarify that the individual technique ablations demonstrate necessity within the sparse regime but do not constitute a controlled dense-versus-sparse experiment. revision: partial
Referee: [Table 1 and Table 2] Table 1 and Table 2: Reported scores for ST-MoE-32B on SuperGLUE, ARC, and summarization tasks lack error bars, standard deviations, or results from multiple random seeds. Given the known sensitivity of large-model fine-tuning, single-run numbers are insufficient to substantiate the “state-of-the-art” claim or to allow readers to assess whether the reported margins are reliable.

Authors: We acknowledge that single-run fine-tuning results at this scale limit the ability to quantify statistical reliability. Training and evaluating the 269 B-parameter model multiple times is computationally prohibitive. In the revised version we will (1) add a short discussion in Section 5 noting this limitation and referencing the variance observed across random seeds in our smaller-scale ablations (reported in the appendix), and (2) qualify the “state-of-the-art” language to “competitive with or exceeding prior published single-run results” where appropriate. We will not be able to add error bars from multiple full-scale runs. revision: partial

Circularity Check

0 steps flagged

Empirical design guide with no derivation chain or self-referential predictions

full rationale

The paper is an empirical contribution focused on training instabilities and fine-tuning quality in large MoE models. It reports scaling results to 269B parameters and SOTA transfer performance on external benchmarks (SuperGLUE, ARC, XSum, etc.). No mathematical derivations, equations, fitted parameters presented as predictions, or load-bearing self-citations appear in the provided text. Central claims are framed as experimental outcomes of design choices rather than quantities that reduce to inputs by construction. The work is self-contained against external benchmarks and does not invoke uniqueness theorems, ansatzes smuggled via citation, or renamings of known results. This matches the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the work is presented as an empirical design study rather than a theoretical derivation.

pith-pipeline@v0.9.0 · 5500 in / 1046 out tokens · 35882 ms · 2026-05-12T23:09:01.254393+00:00 · methodology

discussion (0)

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 7.0

Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
Mixture of Layers with Hybrid Attention
cs.LG 2026-05 unverdicted novelty 7.0

Mixture of Layers replaces monolithic transformer blocks with routed thin parallel blocks using hybrid attention that combines a shared softmax block for global context with Gated DeltaNet linear attention in the rout...
SDG-MoE: Signed Debate Graph Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 7.0

SDG-MoE introduces learned signed interaction graphs and disagreement-gated deliberation among experts in MoE architectures, yielding 19.8% better validation perplexity than the strongest baseline.
SDG-MoE: Signed Debate Graph Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 7.0

SDG-MoE adds learned support and critique graphs plus disagreement-gated message passing to MoE models, yielding 19.8% better validation perplexity than the strongest baseline in three-seed pretraining.
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts
cs.LG 2026-05 conditional novelty 7.0

Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant...
Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality
cs.AI 2026-04 conditional novelty 7.0

Routing topology in sparse Mixture-of-Experts models does not determine asymptotic language modeling perplexity; multiple variants including cosine-similarity routing achieve statistically equivalent performance.
Jamba: A Hybrid Transformer-Mamba Language Model
cs.CL 2024-03 conditional novelty 7.0

Jamba presents a hybrid Transformer-Mamba MoE architecture for LLMs that delivers state-of-the-art benchmark performance and strong results up to 256K token contexts while fitting in one 80GB GPU with high throughput.
Sparse Layers are Critical to Scaling Looped Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.
Hierarchical Mixture-of-Experts with Two-Stage Optimization
cs.LG 2026-05 unverdicted novelty 6.0

Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and v...
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 6.0

A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
Cumulative-Goodness Free-Riding in Forward-Forward Networks: Real, Repairable, but Not Accuracy-Dominant
cs.LG 2026-05 unverdicted novelty 6.0

Cumulative-goodness Forward-Forward networks exhibit layer free-riding where discrimination gradients decay exponentially with prior positive margins; per-block, hardness-gated, and depth-scaled remedies yield 4-45x b...
ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
cs.LG 2026-05 unverdicted novelty 6.0

ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.
Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns
cs.LG 2026-04 unverdicted novelty 6.0

Profiling shows persistent expert load imbalance and domain-specific activation patterns in large MoE models; workload-aware grouping and placement reduce all-to-all communication volume by up to 20x.
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
cs.CL 2025-05 conditional novelty 6.0

Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
Emergent Abilities of Large Language Models
cs.CL 2022-06 unverdicted novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
PaLM: Scaling Language Modeling with Pathways
cs.CL 2022-04 accept novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
Teacher-Guided Routing for Sparse Vision Mixture-of-Experts
cs.CV 2026-04 unverdicted novelty 5.0

Teacher-guided routing supplies pseudo-supervision from a dense model's intermediate features to stabilize expert selection in sparse vision MoE models.
Revisiting Auxiliary Losses for Conditional Depth Routing: An Empirical Study
cs.LG 2026-04 conditional novelty 5.0

Removing utility regression and rank supervision auxiliary losses improves language modeling performance and training efficiency for conditional depth routing gates, and eliminates the advantage of a more complex JEPA...
Does a Global Perspective Help Prune Sparse MoEs Elegantly?
cs.CL 2026-04 unverdicted novelty 5.0

GRAPE is a global redundancy-aware pruning strategy for sparse MoEs that dynamically allocates pruning budgets across layers and improves average accuracy by 1.40% over the best local baseline across tested models and...
OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment
cs.IR 2025-02 unverdicted novelty 5.0

OneRec unifies retrieval and ranking in a generative recommender using session-wise decoding and iterative DPO-based preference alignment, achieving real-world gains on Kuaishou.
PaLM 2 Technical Report
cs.CL 2023-05 unverdicted novelty 5.0

PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
EMO: Frustratingly Easy Progressive Training of Extendable MoE
cs.LG 2026-05 unverdicted novelty 4.0

EMO progressively expands the expert pool in MoE models using scaling-law-derived token budgets per stage, matching fixed-expert performance while cutting wall-clock time and GPU cost.
Token Economics for LLM Agents: A Dual-View Study from Computing and Economics
cs.AI 2026-05 unverdicted novelty 4.0

The paper delivers a unified survey of token economics for LLM agents, conceptualizing tokens as production factors, exchange mediums, and units of account across micro, meso, macro, and security dimensions using esta...
Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization
cs.LG 2026-04 unverdicted novelty 4.0

HILBERT uses joint-centric dual contrastive learning with CKA and mutual information regularizers to align long-sequence audio-text embeddings while preserving structure and balancing modalities.
Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models
cs.CL 2026-04 accept novelty 4.0

Gemma-4-E4B with few-shot chain-of-thought reaches the highest weighted accuracy of 0.675 at 14.9 GB VRAM, while the larger Gemma-4-26B-A4B MoE model scores 0.663 but uses 48.1 GB.

Reference graph

Works this paper leans on

175 extracted references · 175 canonical work pages · cited by 26 Pith papers · 29 internal anchors

[1]

Neural computation , volume=

Adaptive mixtures of local experts , author=. Neural computation , volume=. 1991 , publisher=

work page 1991
[2]

Neural computation , volume=

Hierarchical mixtures of experts and the EM algorithm , author=. Neural computation , volume=. 1994 , publisher=

work page 1994
[6]

12th \ USENIX \ symposium on operating systems design and implementation ( \ OSDI \ 16) , pages=

Tensorflow: A system for large-scale machine learning , author=. 12th \ USENIX \ symposium on operating systems design and implementation ( \ OSDI \ 16) , pages=

work page
[7]

Advances in Neural Information Processing Systems , pages=

Mesh-tensorflow: Deep learning for supercomputers , author=. Advances in Neural Information Processing Systems , pages=

work page
[8]

Improving language understanding by generative pre-training , author=

work page
[14]

Proceedings of the 44th annual international symposium on computer architecture , pages=

In-datacenter performance analysis of a tensor processing unit , author=. Proceedings of the 44th annual international symposium on computer architecture , pages=

work page
[18]

International conference on machine learning , pages=

On the difficulty of training recurrent neural networks , author=. International conference on machine learning , pages=. 2013 , organization=

work page 2013
[19]

International conference on machine learning , pages=

Batch normalization: Accelerating deep network training by reducing internal covariate shift , author=. International conference on machine learning , pages=. 2015 , organization=

work page 2015
[21]

Advances in neural information processing systems , volume=

Weight normalization: A simple reparameterization to accelerate training of deep neural networks , author=. Advances in neural information processing systems , volume=

work page
[23]

Google Cloud Blog , year=

BFloat16: The secret to high performance on Cloud TPUs , author=. Google Cloud Blog , year=

work page
[24]

Google AI Blog , year=

Introducing Pathways: A next-generation AI architecture , author=. Google AI Blog , year=

work page
[27]

Distilling the Knowledge in a Neural Network

Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Reinforcement learning: An introduction , author=

work page
[29]

Advances in Neural Information Processing Systems , pages=

Large memory layers with product keys , author=. Advances in Neural Information Processing Systems , pages=

work page
[30]

Advances in neural information processing systems , pages=

Attention is all you need , author=. Advances in neural information processing systems , pages=

work page
[31]

and Krizhevsky, Alex and Sutskever, Ilya and Salakhutdinov, Ruslan , biburl =

Srivastava, Nitish and Hinton, Geoffrey E. and Krizhevsky, Alex and Sutskever, Ilya and Salakhutdinov, Ruslan , biburl =. Dropout: a simple way to prevent neural networks from overfitting. , url =. Journal of Machine Learning Research , keywords =

work page
[32]

2020 , eprint=

Extracting Training Data from Large Language Models , author=. 2020 , eprint=

work page 2020
[35]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Winogrande: An adversarial winograd schema challenge at scale , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[37]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi

Big bird: Transformers for longer sequences , author=. arXiv preprint arXiv:2007.14062 , year=

work page arXiv 2007
[38]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Megatron-lm: Training multi-billion parameter language models using gpu model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[39]

Teaching Machines to Read and Comprehend , url =

Hermann, Karl Moritz and Kocisky, Tomas and Grefenstette, Edward and Espeholt, Lasse and Kay, Will and Suleyman, Mustafa and Blunsom, Phil , booktitle =. Teaching Machines to Read and Comprehend , url =

work page
[40]

2020 , eprint=

XLNet: Generalized Autoregressive Pretraining for Language Understanding , author=. 2020 , eprint=

work page 2020
[41]

Communications of the ACM , volume=

Roofline: an insightful visual performance model for multicore architectures , author=. Communications of the ACM , volume=. 2009 , publisher=

work page 2009
[42]

2020 , eprint=

GLU Variants Improve Transformer , author=. 2020 , eprint=

work page 2020
[43]

2019 , eprint=

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. 2019 , eprint=

work page 2019
[44]

An Overview of Multi-Task Learning in Deep Neural Networks

An overview of multi-task learning in deep neural networks , author=. arXiv preprint arXiv:1706.05098 , year=

work page internal anchor Pith review arXiv
[45]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Glue: A multi-task benchmark and analysis platform for natural language understanding , author=. arXiv preprint arXiv:1804.07461 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Advances in Neural Information Processing Systems , pages=

Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in Neural Information Processing Systems , pages=

work page
[47]

Advances in neural information processing systems , pages=

Teaching machines to read and comprehend , author=. Advances in neural information processing systems , pages=

work page
[49]

Cloze procedure

“Cloze procedure”: A new tool for measuring readability , author=. Journalism quarterly , volume=. 1953 , publisher=

work page 1953
[52]

The hardware lottery

The Hardware Lottery , author=. arXiv preprint arXiv:2009.06489 , year=

work page arXiv 2009
[53]

International Conference on Learning Representations , year=

Diversity and depth in per-example routing models , author=. International Conference on Learning Representations , year=

work page
[54]

Transactions of the Association for Computational Linguistics , volume=

Natural questions: a benchmark for question answering research , author=. Transactions of the Association for Computational Linguistics , volume=. 2019 , publisher=

work page 2019
[55]

Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

Semantic parsing on freebase from question-answer pairs , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

work page 2013
[57]

Neural computation , volume=

Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=

work page 1997
[58]

Generating Long Sequences with Sparse Transformers

Generating long sequences with sparse transformers , author=. arXiv preprint arXiv:1904.10509 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1904
[59]

Longformer: The Long-Document Transformer

Longformer: The long-document transformer , author=. arXiv preprint arXiv:2004.05150 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2004
[60]

Routing networks: Adaptive selection of non-linear functions for multi-task learning

Routing networks: Adaptive selection of non-linear functions for multi-task learning , author=. arXiv preprint arXiv:1711.01239 , year=

work page arXiv
[61]

Rich Sutton , title =

work page
[62]

PipeDream: Fast and Efficient Pipeline Parallel DNN Training

Pipedream: Fast and efficient pipeline parallel dnn training , author=. arXiv preprint arXiv:1806.03377 , year=

work page Pith review arXiv
[64]

Scalable transfer learning with expert models

Scalable transfer learning with expert models , author=. arXiv preprint arXiv:2009.13239 , year=

work page arXiv 2009
[65]

Exponentially increasing the capacity-to-computation ratio for conditional computation in deep learning

Exponentially increasing the capacity-to-computation ratio for conditional computation in deep learning , author=. arXiv preprint arXiv:1406.7362 , year=

work page arXiv
[66]

Reformer: The Efficient Transformer

Reformer: The efficient transformer , author=. arXiv preprint arXiv:2001.04451 , year=

work page internal anchor Pith review arXiv 2001
[67]

Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures , pages=

Integrated model, batch, and domain parallelism in training neural networks , author=. Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures , pages=

work page
[68]

Gpu kernels for block-sparse weights , author=

work page
[69]

Sparse gpu kernels for deep learning

Sparse GPU Kernels for Deep Learning , author=. arXiv preprint arXiv:2006.10901 , year=

work page arXiv 2006
[70]

arXiv preprint arXiv:1802.08435 , year=

Efficient neural audio synthesis , author=. arXiv preprint arXiv:1802.08435 , year=

work page arXiv
[71]

Strubell, A

Energy and policy considerations for deep learning in NLP , author=. arXiv preprint arXiv:1906.02243 , year=

work page arXiv 1906
[74]

Advances in neural information processing systems , pages=

Gpipe: Efficient training of giant neural networks using pipeline parallelism , author=. Advances in neural information processing systems , pages=

work page
[75]

doi:10.48550/arXiv.2002.08909 , abstract =

Realm: Retrieval-augmented language model pre-training , author=. arXiv preprint arXiv:2002.08909 , year=

work page arXiv 2002
[77]

Bulletin of the American Mathematical Society , volume=

Some aspects of the sequential design of experiments , author=. Bulletin of the American Mathematical Society , volume=. 1952 , publisher=

work page 1952
[78]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

Zero: Memory optimization towards training a trillion parameter models , author=. arXiv preprint arXiv:1910.02054 , year=

work page arXiv 1910
[85]

International conference on machine learning , pages=

Language modeling with gated convolutional networks , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017
[86]

Icml , year=

Rectified linear units improve restricted boltzmann machines , author=. Icml , year=

work page
[88]

International Conference on Machine Learning , pages=

Adafactor: Adaptive learning rates with sublinear memory cost , author=. International Conference on Machine Learning , pages=. 2018 , organization=

work page 2018
[90]

2021 , eprint=

8-bit Optimizers via Block-wise Quantization , author=. 2021 , eprint=

work page 2021
[92]

2012 , publisher=

Antifragile: Things that gain from disorder , author=. 2012 , publisher=

work page 2012
[95]

Adaptively sparse transformers

Adaptively sparse transformers , author=. arXiv preprint arXiv:1909.00015 , year=

work page arXiv 1909
[96]

Adaptive attention span in transformers

Adaptive attention span in transformers , author=. arXiv preprint arXiv:1905.07799 , year=

work page arXiv 1905
[97]

Journal of Machine Learning Research , volume=

Beyond english-centric multilingual machine translation , author=. Journal of Machine Learning Research , volume=

work page
[98]

proceedings of Sinn und Bedeutung , volume=

The CommitmentBank: Investigating projection in naturally occurring discourse , author=. proceedings of Sinn und Bedeutung , volume=

work page
[100]

2015 , eprint=

Deep Residual Learning for Image Recognition , author=. 2015 , eprint=

work page 2015
[101]

Advances in neural information processing systems , volume=

A simple weight decay can improve generalization , author=. Advances in neural information processing systems , volume=

work page
[102]

2019 , eprint=

Decoupled Weight Decay Regularization , author=. 2019 , eprint=

work page 2019
[104]

arXiv preprint arXiv:2107.02137 , year=

ERNIE 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation , author=. arXiv preprint arXiv:2107.02137 , year=

work page arXiv
[105]

2021 , eprint=

Scalable and Efficient MoE Training for Multitask Multilingual Models , author=. 2021 , eprint=

work page 2021
[106]

2021 , eprint=

Taming Sparsely Activated Transformer with Stochastic Experts , author=. 2021 , eprint=

work page 2021
[107]

2021 , eprint=

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , author=. 2021 , eprint=

work page 2021
[108]

2021 , eprint=

Efficient Large Scale Language Modeling with Mixtures of Experts , author=. 2021 , eprint=

work page 2021
[109]

On the Opportunities and Risks of Foundation Models

On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[110]

2021 , eprint=

Sparse-MLP: A Fully-MLP Architecture with Conditional Computation , author=. 2021 , eprint=

work page 2021
[111]

2021 , eprint=

M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining , author=. 2021 , eprint=

work page 2021
[112]

2021 , eprint=

R-Drop: Regularized Dropout for Neural Networks , author=. 2021 , eprint=

work page 2021
[113]

2020 , eprint=

Dense Passage Retrieval for Open-Domain Question Answering , author=. 2020 , eprint=

work page 2020
[114]

2020 , eprint=

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , author=. 2020 , eprint=

work page 2020
[115]

2021 , eprint=

M6-T: Exploring Sparse Expert Models and Beyond , author=. 2021 , eprint=

work page 2021
[116]

2021 , eprint=

SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts , author=. 2021 , eprint=

work page 2021
[117]

2021 , eprint=

SpeechMoE2: Mixture-of-Experts Model with Improved Routing , author=. 2021 , eprint=

work page 2021
[118]

2021 , eprint=

Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition , author=. 2021 , eprint=

work page 2021
[119]

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=

Modeling task relationships in multi-task learning with multi-gate mixture-of-experts , author=. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=

work page

Showing first 80 references.