arxiv: 2101.03961 · v3 · submitted 2021-01-11 · 💻 cs.LG · cs.AI

Recognition: 1 theorem link

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus , Barret Zoph , Noam Shazeer

Authors on Pith no claims yet

Pith reviewed 2026-05-12 23:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Switch Transformermixture of expertssparse modelslanguage model scalingtrillion parametersrouting algorithmpre-training efficiency

0 comments

The pith

Switch Transformers scale language models to a trillion parameters with constant compute and 4x pre-training speedup over T5-XXL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Mixture of Experts models become practical at extreme scale by routing every token to exactly one expert instead of several. This change, paired with simple load-balancing losses and capacity limits, removes most of the prior routing complexity and communication overhead. As a result, models with up to a trillion parameters train stably in bfloat16 and deliver up to 7x faster pre-training than dense T5 models at the same compute budget. The same gains appear across all 101 languages in multilingual experiments.

Core claim

A simplified top-1 gating function in each Switch layer selects only the highest-scoring expert for every token, and auxiliary losses plus expert-capacity factors keep the experts balanced and prevent overflow. This combination allows the total parameter count to grow while the floating-point operations per token remain fixed, enabling stable pre-training of trillion-parameter models that reach the same quality as T5-XXL four times faster.

What carries the argument

The Switch layer, which replaces a standard feed-forward network with a mixture of experts gated by a simple top-1 router that assigns each token to a single expert.

If this is right

Pre-training runs up to 7 times faster than T5-Base and T5-Large at identical compute.
Trillion-parameter models achieve 4 times the pre-training speed of T5-XXL.
Multilingual pre-training improves over mT5-Base on every one of the 101 languages.
Models train successfully in bfloat16, lowering memory use without quality loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The constant-compute property suggests that further increases in expert count could produce still-larger effective capacity on the same hardware.
The same routing simplification may reduce instability in other sparse architectures outside language modeling.
Constant FLOPs per token could make very large models feasible under fixed inference budgets.

Load-bearing premise

The simplified top-1 routing and stabilization techniques continue to produce stable training and competitive quality when both the number of experts and total model size are increased far beyond the scales tested.

What would settle it

Training divergence, expert collapse, or failure to match dense-model quality when the number of experts exceeds a few thousand or total parameters exceed a few trillion would falsify the claim.

read the original abstract

In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the "Colossal Clean Crawled Corpus" and achieve a 4x speedup over the T5-XXL model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Switch Transformers shows a workable top-1 MoE router plus stability fixes that let you train to a trillion parameters with clear speedups over dense T5 baselines.

read the letter

The main point is that this paper makes Mixture of Experts simpler and more stable at extreme scale. They drop the routing to top-1 instead of top-k, add auxiliary losses and capacity factors, and get bfloat16 training to hold up. That combination produces up to 7x pre-training speed on T5-Base and Large sizes and a 4x speedup over T5-XXL at the 1.6T Switch-C model, all while keeping quality competitive on the Colossal Clean Crawled Corpus. They also show consistent gains across 101 languages in the multilingual setting. Those numbers are the concrete advance over earlier MoE work like GShard. The empirical results across multiple model sizes and the fact they actually ran the trillion-parameter case are the strongest parts. The routing change reduces communication and the stability techniques let them avoid higher precision, which matters for real hardware runs. One limitation is that the ablations on the individual stabilization pieces are not broken out at the largest scale. The paper reports overall success but does not isolate how much each fix contributes when experts and parameters go well beyond the 100B checkpoints. That leaves a modest gap in understanding whether the top-1 router would stay reliable without further hyper-parameter work at even bigger sizes. The rest of the evidence is straightforward empirical measurement against dense baselines, with no circularity. This paper is for people who train large language models and want practical ways to increase parameter count without proportional compute. Anyone working on sparse scaling or efficiency will find usable details on the router and training setup. It is solid enough to deserve a serious referee; the scale results and the simplification are worth detailed review even if some ablations could be tighter.

Referee Report

1 major / 2 minor

Summary. The paper introduces the Switch Transformer, a simplified Mixture-of-Experts architecture using top-1 routing that enables sparsely activated models with up to a trillion parameters at constant compute cost. It reports up to 7x pre-training speedups over dense T5-Base/Large models, multilingual gains across 101 languages, and a 4x speedup for a 1.6T-parameter Switch-C model over T5-XXL when pre-trained on C4, while introducing stabilization techniques (auxiliary losses, capacity factors, bfloat16 training) to address instability.

Significance. If the empirical results hold, the work is significant for demonstrating that simple top-1 routing plus targeted stabilizations can scale MoE models to the trillion-parameter regime with practical speedups and competitive quality. The multi-scale experiments, multilingual results, and first reported bfloat16 training of such large sparse models provide concrete evidence that sparsity can be made more accessible, which has influenced subsequent large-model design.

major comments (1)

[§4.3 and §4.4] §4.3 and §4.4 (scaling experiments): The headline trillion-parameter results for Switch-C (1.6T) report aggregate quality and speedup but provide no isolated ablations that vary only the stabilization components (auxiliary loss, capacity factor, bfloat16) at that scale. All detailed ablations are shown on T5-Base/Large-derived models; without scale-specific controls it is impossible to confirm that the reported stability is due to the proposed techniques rather than unstated hyper-parameter retuning.

minor comments (2)

[Abstract and §4] Abstract and §4: Speedup and quality numbers are presented without error bars or run-to-run variance, which weakens the strength of the 4x and 7x claims even though the trends are consistent across model sizes.
[§3.2] §3.2: The capacity-factor and auxiliary-loss formulations are described in prose; adding a compact equation or pseudocode block would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of the work and for recommending minor revision. We address the single major comment below.

read point-by-point responses

Referee: [§4.3 and §4.4] §4.3 and §4.4 (scaling experiments): The headline trillion-parameter results for Switch-C (1.6T) report aggregate quality and speedup but provide no isolated ablations that vary only the stabilization components (auxiliary loss, capacity factor, bfloat16) at that scale. All detailed ablations are shown on T5-Base/Large-derived models; without scale-specific controls it is impossible to confirm that the reported stability is due to the proposed techniques rather than unstated hyper-parameter retuning.

Authors: We agree that isolated ablations varying only the stabilization components at the full 1.6T scale would strengthen the evidence. However, training even one 1.6T model is extremely resource-intensive, and repeating the process for controlled ablations is not feasible. The auxiliary loss, capacity factor, and bfloat16 techniques were developed and validated through detailed experiments on smaller T5-Base/Large-derived models (as reported in §4.3), then applied to enable stable training of Switch-C. The fact that the 1.6T model trained successfully without divergence, using these exact techniques, provides supporting evidence of their utility at scale. We will revise §4.4 to explicitly acknowledge the absence of full-scale isolated ablations, clarify that the techniques generalize from smaller-scale validation, and note the computational constraints. This makes the claims more precise without overstating the evidence. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical scaling results with no derivations reducing to inputs

full rationale

The paper reports measured pre-training speedups, quality metrics, and stability observations from training Switch Transformer variants on C4 and multilingual data. No equations, uniqueness theorems, or first-principles derivations are invoked whose outputs are forced by construction from fitted parameters or self-citations. All reported gains are direct comparisons against dense T5 baselines at matched compute; stabilization techniques are presented as engineering choices validated by ablation tables rather than as predictions derived from the model itself. The central scaling claim (trillion-parameter models with 4x speedup) rests on experimental checkpoints, not on any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard deep-learning assumptions about gradient-based optimization and the existence of useful expert specialization; no new physical or mathematical axioms are introduced. The number of experts and the capacity factor are design choices rather than fitted parameters for the reported results.

axioms (1)

domain assumption Gradient descent on a sparsely activated network will converge to a useful solution when combined with the described stabilization techniques.
Invoked implicitly when claiming stable training of trillion-parameter models.

pith-pipeline@v0.9.0 · 5517 in / 1271 out tokens · 29689 ms · 2026-05-12T23:53:12.636306+00:00 · methodology

discussion (0)

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 7.0

Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
SDG-MoE: Signed Debate Graph Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 7.0

SDG-MoE adds learned support and critique graphs plus disagreement-gated message passing to MoE models, yielding 19.8% better validation perplexity than the strongest baseline in three-seed pretraining.
SDG-MoE: Signed Debate Graph Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 7.0

SDG-MoE introduces learned signed interaction graphs and disagreement-gated deliberation among experts in MoE architectures, yielding 19.8% better validation perplexity than the strongest baseline.
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
cs.LG 2026-05 conditional novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
Boundary Mass and the Soft-to-Hard Limit in Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 7.0

Boundary mass in MoE is linear in slab width under smoothness and transversality, so the zero-temperature limit is governed by a thin geometric layer around routing interfaces rather than the full input space.
Model Compression with Exact Budget Constraints via Riemannian Manifolds
cs.LG 2026-05 unverdicted novelty 7.0

The budget constraint in discrete model compression defines a Riemannian manifold allowing exact-constraint first-order optimization via Riemannian Constrained Optimization (RCO) without extra hyperparameters.
Plasticity-Enhanced Multi-Agent Mixture of Experts for Dynamic Objective Adaptation in UAVs-Assisted Emergency Communication Networks
cs.MA 2026-04 unverdicted novelty 7.0

PE-MAMoE combines sparsely gated mixture-of-experts actors with a non-parametric phase controller in MAPPO to maintain plasticity under dynamic user mobility and traffic, yielding 26.3% higher normalized IQM return in...
InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models
cs.DC 2026-04 unverdicted novelty 7.0

InfiniLoRA decouples LoRA execution from base-model inference and reports 3.05x higher request throughput plus 54% more adapters meeting strict latency SLOs.
PLUME: Latent Reasoning Based Universal Multimodal Embedding
cs.CV 2026-04 unverdicted novelty 7.0

PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
cs.LG 2022-08 conditional novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Quantifying Memorization Across Neural Language Models
cs.LG 2022-02 unverdicted novelty 7.0

Memorization in language models increases log-linearly with model capacity, data duplication count, and prompt context length.
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
cs.CL 2020-06 unverdicted novelty 7.0

DeBERTa improves BERT-style models by separating content and relative position in attention and adding absolute positions to the decoder, yielding consistent gains on NLU and NLG tasks and the first single-model super...
Federation of Experts: Communication Efficient Distributed Inference for Large Language Models
cs.LG 2026-05 unverdicted novelty 6.0

FoE restructures MoE blocks into per-KV-head clusters with sum-based synchronization, removing all-to-all communication in single-node settings and limiting it to intra-node in multi-node settings for up to 5.2x faste...
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism
cs.DC 2026-05 unverdicted novelty 6.0

Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.
InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition
cs.CL 2026-05 unverdicted novelty 6.0

InfoLaw models pretraining as information accumulation where quality sets information density and repetition causes scale-dependent diminishing returns, predicting loss with low error on unseen mixtures and larger sca...
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models
cs.LG 2026-04 unverdicted novelty 6.0

HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
Decoupled DiLoCo for Resilient Distributed Pre-training
cs.CL 2026-04 unverdicted novelty 6.0

Decoupled DiLoCo enables asynchronous distributed pre-training with zero global downtime under simulated failures while preserving competitive performance on text and vision tasks.
Temporally Extended Mixture-of-Experts Models
cs.LG 2026-04 unverdicted novelty 6.0

Temporally extended MoE layers using the option-critic framework with deliberation costs cut switching rates below 5% while retaining most capability on MATH, MMLU, and MMMLU.
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 6.0

BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
cs.LG 2024-01 unverdicted novelty 6.0

SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on be...
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
cs.LG 2023-09 accept novelty 6.0

DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
Emergent Abilities of Large Language Models
cs.CL 2022-06 unverdicted novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
PaLM: Scaling Language Modeling with Pathways
cs.CL 2022-04 accept novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
ST-MoE: Designing Stable and Transferable Sparse Expert Models
cs.CL 2022-02 unverdicted novelty 6.0

ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
Ethical and social risks of harm from Language Models
cs.CL 2021-12 accept novelty 6.0

The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job...
Complexity Horizons of Compressed Models in Analog Circuit Analysis
cs.AI 2026-05 unverdicted novelty 5.0

Prerequisite graphs map compressed LLM performance boundaries in analog circuit analysis to allow selecting the smallest viable model for a given task complexity.
Domain-Specialized Object Detection via Model-Level Mixtures of Experts
cs.CV 2026-04 unverdicted novelty 5.0

Model-level MoE of domain-specialized YOLO detectors with gating network outperforms standard ensembles on BDD100K while revealing expert specialization.
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
cs.CL 2024-01 unverdicted novelty 5.0

DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
cs.CV 2026-05 unverdicted novelty 4.0

Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 31 Pith papers · 17 internal anchors

[1]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document trans- former. arXiv preprint arXiv:2004.05150 ,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[2]

Semantic parsing on free- base from question-answer pairs

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on free- base from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing , pages 1533–1544,

work page 2013
[3]

Language Models are Few-Shot Learners

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 ,

work page internal anchor Pith review Pith/arXiv arXiv 2005
[4]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 ,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[5]

Exponentially increasing the capacity-to-computation ratio for conditional computation in deep learning

Kyunghyun Cho and Yoshua Bengio. Exponentially increasing the capacity-to-computation ratio for conditional computation in deep learning. arXiv preprint arXiv:1406.7362,

work page arXiv
[6]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 ,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Fabio Giampaolo, Stefano Izzo, Edoardo Prezioso, and Francesco Piccialli

Gon¸ calo M Correia, Vlad Niculae, and Andr´ e FT Martins. Adaptively sparse transformers. arXiv preprint arXiv:1909.00015 ,

work page arXiv 1909
[8]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Learning Factored Representations in a Deep Mixture of Experts

David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314 ,

work page Pith review arXiv
[10]

Maskgan: Better text generation via ﬁlling in the

William Fedus, Ian Goodfellow, and Andrew M Dai. Maskgan: Better text generation via ﬁlling in the . arXiv preprint arXiv:1801.07736 ,

work page arXiv
[11]

Sparse gpu kernels for deep learning

Trevor Gale, Matei Zaharia, Cliﬀ Young, and Erich Elsen. Sparse gpu kernels for deep learning. arXiv preprint arXiv:2006.10901 ,

work page arXiv 2006
[12]

doi:10.48550/arXiv.2002.08909 , abstract =

36 Switch Transformers Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909 ,

work page arXiv 2002
[13]

PipeDream: Fast and Efficient Pipeline Parallel DNN Training

Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. Pipedream: Fast and eﬃcient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 ,

work page Pith review arXiv
[14]

Distilling the Knowledge in a Neural Network

URL https://proceedings.neurips.cc/paper/2015/file/ afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf. Geoﬀrey Hinton, Oriol Vinyals, and Jeﬀ Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 ,

work page internal anchor Pith review Pith/arXiv arXiv 2015
[15]

The hardware lottery

Sara Hooker. The hardware lottery. arXiv preprint arXiv:2009.06489 ,

work page arXiv 2009
[16]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeﬀrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 ,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[18]

Reformer: The Efficient Transformer

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The eﬃcient transformer. arXiv preprint arXiv:2001.04451 ,

work page internal anchor Pith review arXiv 2001
[19]

Deduplicating training data makes language models better

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499 ,

work page arXiv
[20]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[21]

Mixed Precision Training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740 ,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745,

work page Pith review arXiv
[23]

Adversarial nli: A new benchmark for natural language understanding

Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599,

work page arXiv 1910
[24]

Scalable transfer learning with expert models

Joan Puigcerver, Carlos Riquelme, Basil Mustafa, Cedric Renggli, Andr´ e Susano Pinto, Sylvain Gelly, Daniel Keysers, and Neil Houlsby. Scalable transfer learning with expert models. arXiv preprint arXiv:2009.13239 ,

work page arXiv 2009
[25]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raﬀel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. arXiv preprint arXiv:1910.10683 ,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[26]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

Samyam Rajbhandari, Jeﬀ Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory opti- mization towards training a trillion parameter models. arXiv preprint arXiv:1910.02054,

work page arXiv 1910
[27]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 ,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

How Much Knowledge Can You Pack Into the Parameters of a Language Model?

38 Switch Transformers Adam Roberts, Colin Raﬀel, and Noam Shazeer. How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910 ,

work page internal anchor Pith review arXiv 2002
[29]

Routing networks: Adaptive selection of non-linear functions for multi-task learning

Clemens Rosenbaum, Tim Klinger, and Matthew Riemer. Routing networks: Adaptive selection of non-linear functions for multi-task learning. arXiv preprint arXiv:1711.01239,

work page arXiv
[30]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoﬀrey Hin- ton, and Jeﬀ Dean. Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. arXiv preprint arXiv:1701.06538 ,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053 ,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[32]

Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov

Nitish Srivastava, Geoﬀrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting. Journal of Machine Learning Research , 15(1):1929–1958,

work page 1929
[33]

Energy and Policy Considerations for Deep Learning in NLP

URL http://www.cs. toronto.edu/~rsalakhu/papers/srivastava14a.pdf. Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243 ,

work page Pith review arXiv 1906
[34]

Adaptive attention span in transformers

Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. Adaptive attention span in transformers. arXiv preprint arXiv:1905.07799 ,

work page arXiv 1905
[35]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bow- man. Glue: A multi-task benchmark and analysis platform for natural language under- standing. arXiv preprint arXiv:1804.07461 ,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

mt5: A massively multilingual pre-trained text-to-text transformer

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raﬀel. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934 ,

work page arXiv 2010
[37]

A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al

Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santi- ago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. arXiv preprint arXiv:2007.14062 ,

work page arXiv 2007