pith. machine review for the scientific record. sign in

arxiv: 2101.03961 · v3 · submitted 2021-01-11 · 💻 cs.LG · cs.AI

Recognition: 1 theorem link

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus , Barret Zoph , Noam Shazeer

Authors on Pith no claims yet

Pith reviewed 2026-05-12 23:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Switch Transformermixture of expertssparse modelslanguage model scalingtrillion parametersrouting algorithmpre-training efficiency
0
0 comments X

The pith

Switch Transformers scale language models to a trillion parameters with constant compute and 4x pre-training speedup over T5-XXL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Mixture of Experts models become practical at extreme scale by routing every token to exactly one expert instead of several. This change, paired with simple load-balancing losses and capacity limits, removes most of the prior routing complexity and communication overhead. As a result, models with up to a trillion parameters train stably in bfloat16 and deliver up to 7x faster pre-training than dense T5 models at the same compute budget. The same gains appear across all 101 languages in multilingual experiments.

Core claim

A simplified top-1 gating function in each Switch layer selects only the highest-scoring expert for every token, and auxiliary losses plus expert-capacity factors keep the experts balanced and prevent overflow. This combination allows the total parameter count to grow while the floating-point operations per token remain fixed, enabling stable pre-training of trillion-parameter models that reach the same quality as T5-XXL four times faster.

What carries the argument

The Switch layer, which replaces a standard feed-forward network with a mixture of experts gated by a simple top-1 router that assigns each token to a single expert.

If this is right

  • Pre-training runs up to 7 times faster than T5-Base and T5-Large at identical compute.
  • Trillion-parameter models achieve 4 times the pre-training speed of T5-XXL.
  • Multilingual pre-training improves over mT5-Base on every one of the 101 languages.
  • Models train successfully in bfloat16, lowering memory use without quality loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The constant-compute property suggests that further increases in expert count could produce still-larger effective capacity on the same hardware.
  • The same routing simplification may reduce instability in other sparse architectures outside language modeling.
  • Constant FLOPs per token could make very large models feasible under fixed inference budgets.

Load-bearing premise

The simplified top-1 routing and stabilization techniques continue to produce stable training and competitive quality when both the number of experts and total model size are increased far beyond the scales tested.

What would settle it

Training divergence, expert collapse, or failure to match dense-model quality when the number of experts exceeds a few thousand or total parameters exceed a few trillion would falsify the claim.

read the original abstract

In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the "Colossal Clean Crawled Corpus" and achieve a 4x speedup over the T5-XXL model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces the Switch Transformer, a simplified Mixture-of-Experts architecture using top-1 routing that enables sparsely activated models with up to a trillion parameters at constant compute cost. It reports up to 7x pre-training speedups over dense T5-Base/Large models, multilingual gains across 101 languages, and a 4x speedup for a 1.6T-parameter Switch-C model over T5-XXL when pre-trained on C4, while introducing stabilization techniques (auxiliary losses, capacity factors, bfloat16 training) to address instability.

Significance. If the empirical results hold, the work is significant for demonstrating that simple top-1 routing plus targeted stabilizations can scale MoE models to the trillion-parameter regime with practical speedups and competitive quality. The multi-scale experiments, multilingual results, and first reported bfloat16 training of such large sparse models provide concrete evidence that sparsity can be made more accessible, which has influenced subsequent large-model design.

major comments (1)
  1. [§4.3 and §4.4] §4.3 and §4.4 (scaling experiments): The headline trillion-parameter results for Switch-C (1.6T) report aggregate quality and speedup but provide no isolated ablations that vary only the stabilization components (auxiliary loss, capacity factor, bfloat16) at that scale. All detailed ablations are shown on T5-Base/Large-derived models; without scale-specific controls it is impossible to confirm that the reported stability is due to the proposed techniques rather than unstated hyper-parameter retuning.
minor comments (2)
  1. [Abstract and §4] Abstract and §4: Speedup and quality numbers are presented without error bars or run-to-run variance, which weakens the strength of the 4x and 7x claims even though the trends are consistent across model sizes.
  2. [§3.2] §3.2: The capacity-factor and auxiliary-loss formulations are described in prose; adding a compact equation or pseudocode block would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of the work and for recommending minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: [§4.3 and §4.4] §4.3 and §4.4 (scaling experiments): The headline trillion-parameter results for Switch-C (1.6T) report aggregate quality and speedup but provide no isolated ablations that vary only the stabilization components (auxiliary loss, capacity factor, bfloat16) at that scale. All detailed ablations are shown on T5-Base/Large-derived models; without scale-specific controls it is impossible to confirm that the reported stability is due to the proposed techniques rather than unstated hyper-parameter retuning.

    Authors: We agree that isolated ablations varying only the stabilization components at the full 1.6T scale would strengthen the evidence. However, training even one 1.6T model is extremely resource-intensive, and repeating the process for controlled ablations is not feasible. The auxiliary loss, capacity factor, and bfloat16 techniques were developed and validated through detailed experiments on smaller T5-Base/Large-derived models (as reported in §4.3), then applied to enable stable training of Switch-C. The fact that the 1.6T model trained successfully without divergence, using these exact techniques, provides supporting evidence of their utility at scale. We will revise §4.4 to explicitly acknowledge the absence of full-scale isolated ablations, clarify that the techniques generalize from smaller-scale validation, and note the computational constraints. This makes the claims more precise without overstating the evidence. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical scaling results with no derivations reducing to inputs

full rationale

The paper reports measured pre-training speedups, quality metrics, and stability observations from training Switch Transformer variants on C4 and multilingual data. No equations, uniqueness theorems, or first-principles derivations are invoked whose outputs are forced by construction from fitted parameters or self-citations. All reported gains are direct comparisons against dense T5 baselines at matched compute; stabilization techniques are presented as engineering choices validated by ablation tables rather than as predictions derived from the model itself. The central scaling claim (trillion-parameter models with 4x speedup) rests on experimental checkpoints, not on any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard deep-learning assumptions about gradient-based optimization and the existence of useful expert specialization; no new physical or mathematical axioms are introduced. The number of experts and the capacity factor are design choices rather than fitted parameters for the reported results.

axioms (1)
  • domain assumption Gradient descent on a sparsely activated network will converge to a useful solution when combined with the described stabilization techniques.
    Invoked implicitly when claiming stable training of trillion-parameter models.

pith-pipeline@v0.9.0 · 5517 in / 1271 out tokens · 29689 ms · 2026-05-12T23:53:12.636306+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

    cs.LG 2026-05 unverdicted novelty 7.0

    Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.

  2. SDG-MoE: Signed Debate Graph Mixture-of-Experts

    cs.LG 2026-05 unverdicted novelty 7.0

    SDG-MoE adds learned support and critique graphs plus disagreement-gated message passing to MoE models, yielding 19.8% better validation perplexity than the strongest baseline in three-seed pretraining.

  3. SDG-MoE: Signed Debate Graph Mixture-of-Experts

    cs.LG 2026-05 unverdicted novelty 7.0

    SDG-MoE introduces learned signed interaction graphs and disagreement-gated deliberation among experts in MoE architectures, yielding 19.8% better validation perplexity than the strongest baseline.

  4. MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

    cs.LG 2026-05 conditional novelty 7.0

    MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.

  5. Boundary Mass and the Soft-to-Hard Limit in Mixture-of-Experts

    cs.LG 2026-05 unverdicted novelty 7.0

    Boundary mass in MoE is linear in slab width under smoothness and transversality, so the zero-temperature limit is governed by a thin geometric layer around routing interfaces rather than the full input space.

  6. Model Compression with Exact Budget Constraints via Riemannian Manifolds

    cs.LG 2026-05 unverdicted novelty 7.0

    The budget constraint in discrete model compression defines a Riemannian manifold allowing exact-constraint first-order optimization via Riemannian Constrained Optimization (RCO) without extra hyperparameters.

  7. Plasticity-Enhanced Multi-Agent Mixture of Experts for Dynamic Objective Adaptation in UAVs-Assisted Emergency Communication Networks

    cs.MA 2026-04 unverdicted novelty 7.0

    PE-MAMoE combines sparsely gated mixture-of-experts actors with a non-parametric phase controller in MAPPO to maintain plasticity under dynamic user mobility and traffic, yielding 26.3% higher normalized IQM return in...

  8. InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models

    cs.DC 2026-04 unverdicted novelty 7.0

    InfiniLoRA decouples LoRA execution from base-model inference and reports 3.05x higher request throughput plus 54% more adapters meeting strict latency SLOs.

  9. PLUME: Latent Reasoning Based Universal Multimodal Embedding

    cs.CV 2026-04 unverdicted novelty 7.0

    PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.

  10. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    cs.LG 2025-02 unverdicted novelty 7.0

    A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

  11. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    cs.CL 2024-05 unverdicted novelty 7.0

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  12. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    cs.LG 2022-08 conditional novelty 7.0

    LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.

  13. OPT: Open Pre-trained Transformer Language Models

    cs.CL 2022-05 unverdicted novelty 7.0

    OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

  14. Quantifying Memorization Across Neural Language Models

    cs.LG 2022-02 unverdicted novelty 7.0

    Memorization in language models increases log-linearly with model capacity, data duplication count, and prompt context length.

  15. DeBERTa: Decoding-enhanced BERT with Disentangled Attention

    cs.CL 2020-06 unverdicted novelty 7.0

    DeBERTa improves BERT-style models by separating content and relative position in attention and adding absolute positions to the decoder, yielding consistent gains on NLU and NLG tasks and the first single-model super...

  16. Federation of Experts: Communication Efficient Distributed Inference for Large Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    FoE restructures MoE blocks into per-KV-head clusters with sum-based synchronization, removing all-to-all communication in single-node settings and limiting it to intra-node in multi-node settings for up to 5.2x faste...

  17. Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism

    cs.DC 2026-05 unverdicted novelty 6.0

    Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.

  18. InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition

    cs.CL 2026-05 unverdicted novelty 6.0

    InfoLaw models pretraining as information accumulation where quality sets information density and repetition causes scale-dependent diminishing returns, predicting loss with low error on unseen mixtures and larger sca...

  19. HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models

    cs.LG 2026-04 unverdicted novelty 6.0

    HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.

  20. Decoupled DiLoCo for Resilient Distributed Pre-training

    cs.CL 2026-04 unverdicted novelty 6.0

    Decoupled DiLoCo enables asynchronous distributed pre-training with zero global downtime under simulated failures while preserving competitive performance on text and vision tasks.

  21. Temporally Extended Mixture-of-Experts Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Temporally extended MoE layers using the option-critic framework with deliberation costs cut switching rates below 5% while retaining most capability on MATH, MMLU, and MMMLU.

  22. Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.

  23. Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

    cs.LG 2024-01 unverdicted novelty 6.0

    SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on be...

  24. DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    cs.LG 2023-09 accept novelty 6.0

    DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.

  25. Emergent Abilities of Large Language Models

    cs.CL 2022-06 unverdicted novelty 6.0

    Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.

  26. PaLM: Scaling Language Modeling with Pathways

    cs.CL 2022-04 accept novelty 6.0

    PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.

  27. ST-MoE: Designing Stable and Transferable Sparse Expert Models

    cs.CL 2022-02 unverdicted novelty 6.0

    ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...

  28. Ethical and social risks of harm from Language Models

    cs.CL 2021-12 accept novelty 6.0

    The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job...

  29. Complexity Horizons of Compressed Models in Analog Circuit Analysis

    cs.AI 2026-05 unverdicted novelty 5.0

    Prerequisite graphs map compressed LLM performance boundaries in analog circuit analysis to allow selecting the smallest viable model for a given task complexity.

  30. Domain-Specialized Object Detection via Model-Level Mixtures of Experts

    cs.CV 2026-04 unverdicted novelty 5.0

    Model-level MoE of domain-specialized YOLO detectors with gating network outperforms standard ensembles on BDD100K while revealing expert specialization.

  31. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    cs.CL 2024-01 unverdicted novelty 5.0

    DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.

  32. Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

    cs.CV 2026-05 unverdicted novelty 4.0

    Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 31 Pith papers · 17 internal anchors

  1. [1]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document trans- former. arXiv preprint arXiv:2004.05150 ,

  2. [2]

    Semantic parsing on free- base from question-answer pairs

    Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on free- base from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing , pages 1533–1544,

  3. [3]

    Language Models are Few-Shot Learners

    Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 ,

  4. [4]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 ,

  5. [5]

    Exponentially increasing the capacity-to-computation ratio for conditional computation in deep learning

    Kyunghyun Cho and Yoshua Bengio. Exponentially increasing the capacity-to-computation ratio for conditional computation in deep learning. arXiv preprint arXiv:1406.7362,

  6. [6]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 ,

  7. [7]

    Fabio Giampaolo, Stefano Izzo, Edoardo Prezioso, and Francesco Piccialli

    Gon¸ calo M Correia, Vlad Niculae, and Andr´ e FT Martins. Adaptively sparse transformers. arXiv preprint arXiv:1909.00015 ,

  8. [8]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

  9. [9]

    Learning Factored Representations in a Deep Mixture of Experts

    David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314 ,

  10. [10]

    Maskgan: Better text generation via filling in the

    William Fedus, Ian Goodfellow, and Andrew M Dai. Maskgan: Better text generation via filling in the . arXiv preprint arXiv:1801.07736 ,

  11. [11]

    Sparse gpu kernels for deep learning

    Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. Sparse gpu kernels for deep learning. arXiv preprint arXiv:2006.10901 ,

  12. [12]

    doi:10.48550/arXiv.2002.08909 , abstract =

    36 Switch Transformers Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909 ,

  13. [13]

    PipeDream: Fast and Efficient Pipeline Parallel DNN Training

    Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 ,

  14. [14]

    Distilling the Knowledge in a Neural Network

    URL https://proceedings.neurips.cc/paper/2015/file/ afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 ,

  15. [15]

    The hardware lottery

    Sara Hooker. The hardware lottery. arXiv preprint arXiv:2009.06489 ,

  16. [16]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551,

  17. [17]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 ,

  18. [18]

    Reformer: The Efficient Transformer

    Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451 ,

  19. [19]

    Deduplicating training data makes language models better

    Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499 ,

  20. [20]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668,

  21. [21]

    Mixed Precision Training

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740 ,

  22. [22]

    Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

    Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745,

  23. [23]

    Adversarial nli: A new benchmark for natural language understanding

    Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599,

  24. [24]

    Scalable transfer learning with expert models

    Joan Puigcerver, Carlos Riquelme, Basil Mustafa, Cedric Renggli, Andr´ e Susano Pinto, Sylvain Gelly, Daniel Keysers, and Neil Houlsby. Scalable transfer learning with expert models. arXiv preprint arXiv:2009.13239 ,

  25. [25]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 ,

  26. [26]

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory opti- mization towards training a trillion parameter models. arXiv preprint arXiv:1910.02054,

  27. [27]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 ,

  28. [28]

    How Much Knowledge Can You Pack Into the Parameters of a Language Model?

    38 Switch Transformers Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910 ,

  29. [29]

    Routing networks: Adaptive selection of non-linear functions for multi-task learning

    Clemens Rosenbaum, Tim Klinger, and Matthew Riemer. Routing networks: Adaptive selection of non-linear functions for multi-task learning. arXiv preprint arXiv:1711.01239,

  30. [30]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hin- ton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. arXiv preprint arXiv:1701.06538 ,

  31. [31]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053 ,

  32. [32]

    Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov

    Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research , 15(1):1929–1958,

  33. [33]

    Energy and Policy Considerations for Deep Learning in NLP

    URL http://www.cs. toronto.edu/~rsalakhu/papers/srivastava14a.pdf. Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243 ,

  34. [34]

    Adaptive attention span in transformers

    Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. Adaptive attention span in transformers. arXiv preprint arXiv:1905.07799 ,

  35. [35]

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bow- man. Glue: A multi-task benchmark and analysis platform for natural language under- standing. arXiv preprint arXiv:1804.07461 ,

  36. [36]

    mt5: A massively multilingual pre-trained text-to-text transformer

    Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934 ,

  37. [37]

    A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al

    Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santi- ago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. arXiv preprint arXiv:2007.14062 ,