Recognition: 1 theorem link
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Pith reviewed 2026-05-12 23:53 UTC · model grok-4.3
The pith
Switch Transformers scale language models to a trillion parameters with constant compute and 4x pre-training speedup over T5-XXL.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A simplified top-1 gating function in each Switch layer selects only the highest-scoring expert for every token, and auxiliary losses plus expert-capacity factors keep the experts balanced and prevent overflow. This combination allows the total parameter count to grow while the floating-point operations per token remain fixed, enabling stable pre-training of trillion-parameter models that reach the same quality as T5-XXL four times faster.
What carries the argument
The Switch layer, which replaces a standard feed-forward network with a mixture of experts gated by a simple top-1 router that assigns each token to a single expert.
If this is right
- Pre-training runs up to 7 times faster than T5-Base and T5-Large at identical compute.
- Trillion-parameter models achieve 4 times the pre-training speed of T5-XXL.
- Multilingual pre-training improves over mT5-Base on every one of the 101 languages.
- Models train successfully in bfloat16, lowering memory use without quality loss.
Where Pith is reading between the lines
- The constant-compute property suggests that further increases in expert count could produce still-larger effective capacity on the same hardware.
- The same routing simplification may reduce instability in other sparse architectures outside language modeling.
- Constant FLOPs per token could make very large models feasible under fixed inference budgets.
Load-bearing premise
The simplified top-1 routing and stabilization techniques continue to produce stable training and competitive quality when both the number of experts and total model size are increased far beyond the scales tested.
What would settle it
Training divergence, expert collapse, or failure to match dense-model quality when the number of experts exceeds a few thousand or total parameters exceed a few trillion would falsify the claim.
read the original abstract
In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the "Colossal Clean Crawled Corpus" and achieve a 4x speedup over the T5-XXL model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Switch Transformer, a simplified Mixture-of-Experts architecture using top-1 routing that enables sparsely activated models with up to a trillion parameters at constant compute cost. It reports up to 7x pre-training speedups over dense T5-Base/Large models, multilingual gains across 101 languages, and a 4x speedup for a 1.6T-parameter Switch-C model over T5-XXL when pre-trained on C4, while introducing stabilization techniques (auxiliary losses, capacity factors, bfloat16 training) to address instability.
Significance. If the empirical results hold, the work is significant for demonstrating that simple top-1 routing plus targeted stabilizations can scale MoE models to the trillion-parameter regime with practical speedups and competitive quality. The multi-scale experiments, multilingual results, and first reported bfloat16 training of such large sparse models provide concrete evidence that sparsity can be made more accessible, which has influenced subsequent large-model design.
major comments (1)
- [§4.3 and §4.4] §4.3 and §4.4 (scaling experiments): The headline trillion-parameter results for Switch-C (1.6T) report aggregate quality and speedup but provide no isolated ablations that vary only the stabilization components (auxiliary loss, capacity factor, bfloat16) at that scale. All detailed ablations are shown on T5-Base/Large-derived models; without scale-specific controls it is impossible to confirm that the reported stability is due to the proposed techniques rather than unstated hyper-parameter retuning.
minor comments (2)
- [Abstract and §4] Abstract and §4: Speedup and quality numbers are presented without error bars or run-to-run variance, which weakens the strength of the 4x and 7x claims even though the trends are consistent across model sizes.
- [§3.2] §3.2: The capacity-factor and auxiliary-loss formulations are described in prose; adding a compact equation or pseudocode block would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the work and for recommending minor revision. We address the single major comment below.
read point-by-point responses
-
Referee: [§4.3 and §4.4] §4.3 and §4.4 (scaling experiments): The headline trillion-parameter results for Switch-C (1.6T) report aggregate quality and speedup but provide no isolated ablations that vary only the stabilization components (auxiliary loss, capacity factor, bfloat16) at that scale. All detailed ablations are shown on T5-Base/Large-derived models; without scale-specific controls it is impossible to confirm that the reported stability is due to the proposed techniques rather than unstated hyper-parameter retuning.
Authors: We agree that isolated ablations varying only the stabilization components at the full 1.6T scale would strengthen the evidence. However, training even one 1.6T model is extremely resource-intensive, and repeating the process for controlled ablations is not feasible. The auxiliary loss, capacity factor, and bfloat16 techniques were developed and validated through detailed experiments on smaller T5-Base/Large-derived models (as reported in §4.3), then applied to enable stable training of Switch-C. The fact that the 1.6T model trained successfully without divergence, using these exact techniques, provides supporting evidence of their utility at scale. We will revise §4.4 to explicitly acknowledge the absence of full-scale isolated ablations, clarify that the techniques generalize from smaller-scale validation, and note the computational constraints. This makes the claims more precise without overstating the evidence. revision: partial
Circularity Check
No circularity: purely empirical scaling results with no derivations reducing to inputs
full rationale
The paper reports measured pre-training speedups, quality metrics, and stability observations from training Switch Transformer variants on C4 and multilingual data. No equations, uniqueness theorems, or first-principles derivations are invoked whose outputs are forced by construction from fitted parameters or self-citations. All reported gains are direct comparisons against dense T5 baselines at matched compute; stabilization techniques are presented as engineering choices validated by ablation tables rather than as predictions derived from the model itself. The central scaling claim (trillion-parameter models with 4x speedup) rests on experimental checkpoints, not on any self-referential reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gradient descent on a sparsely activated network will converge to a useful solution when combined with the described stabilization techniques.
Forward citations
Cited by 32 Pith papers
-
Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
-
SDG-MoE: Signed Debate Graph Mixture-of-Experts
SDG-MoE adds learned support and critique graphs plus disagreement-gated message passing to MoE models, yielding 19.8% better validation perplexity than the strongest baseline in three-seed pretraining.
-
SDG-MoE: Signed Debate Graph Mixture-of-Experts
SDG-MoE introduces learned signed interaction graphs and disagreement-gated deliberation among experts in MoE architectures, yielding 19.8% better validation perplexity than the strongest baseline.
-
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
-
Boundary Mass and the Soft-to-Hard Limit in Mixture-of-Experts
Boundary mass in MoE is linear in slab width under smoothness and transversality, so the zero-temperature limit is governed by a thin geometric layer around routing interfaces rather than the full input space.
-
Model Compression with Exact Budget Constraints via Riemannian Manifolds
The budget constraint in discrete model compression defines a Riemannian manifold allowing exact-constraint first-order optimization via Riemannian Constrained Optimization (RCO) without extra hyperparameters.
-
Plasticity-Enhanced Multi-Agent Mixture of Experts for Dynamic Objective Adaptation in UAVs-Assisted Emergency Communication Networks
PE-MAMoE combines sparsely gated mixture-of-experts actors with a non-parametric phase controller in MAPPO to maintain plasticity under dynamic user mobility and traffic, yielding 26.3% higher normalized IQM return in...
-
InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models
InfiniLoRA decouples LoRA execution from base-model inference and reports 3.05x higher request throughput plus 54% more adapters meeting strict latency SLOs.
-
PLUME: Latent Reasoning Based Universal Multimodal Embedding
PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
Quantifying Memorization Across Neural Language Models
Memorization in language models increases log-linearly with model capacity, data duplication count, and prompt context length.
-
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
DeBERTa improves BERT-style models by separating content and relative position in attention and adding absolute positions to the decoder, yielding consistent gains on NLU and NLG tasks and the first single-model super...
-
Federation of Experts: Communication Efficient Distributed Inference for Large Language Models
FoE restructures MoE blocks into per-KV-head clusters with sum-based synchronization, removing all-to-all communication in single-node settings and limiting it to intra-node in multi-node settings for up to 5.2x faste...
-
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism
Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.
-
InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition
InfoLaw models pretraining as information accumulation where quality sets information density and repetition causes scale-dependent diminishing returns, predicting loss with low error on unseen mixtures and larger sca...
-
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models
HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
-
Decoupled DiLoCo for Resilient Distributed Pre-training
Decoupled DiLoCo enables asynchronous distributed pre-training with zero global downtime under simulated failures while preserving competitive performance on text and vision tasks.
-
Temporally Extended Mixture-of-Experts Models
Temporally extended MoE layers using the option-critic framework with deliberation costs cut switching rates below 5% while retaining most capability on MATH, MMLU, and MMMLU.
-
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
-
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on be...
-
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
-
Emergent Abilities of Large Language Models
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
-
PaLM: Scaling Language Modeling with Pathways
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
-
ST-MoE: Designing Stable and Transferable Sparse Expert Models
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
-
Ethical and social risks of harm from Language Models
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job...
-
Complexity Horizons of Compressed Models in Analog Circuit Analysis
Prerequisite graphs map compressed LLM performance boundaries in analog circuit analysis to allow selecting the smallest viable model for a given task complexity.
-
Domain-Specialized Object Detection via Model-Level Mixtures of Experts
Model-level MoE of domain-specialized YOLO detectors with gating network outperforms standard ensembles on BDD100K while revealing expert specialization.
-
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
-
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
Reference graph
Works this paper leans on
-
[1]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document trans- former. arXiv preprint arXiv:2004.05150 ,
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[2]
Semantic parsing on free- base from question-answer pairs
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on free- base from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing , pages 1533–1544,
work page 2013
-
[3]
Language Models are Few-Shot Learners
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 ,
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[4]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 ,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[5]
Kyunghyun Cho and Yoshua Bengio. Exponentially increasing the capacity-to-computation ratio for conditional computation in deep learning. arXiv preprint arXiv:1406.7362,
-
[6]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Fabio Giampaolo, Stefano Izzo, Edoardo Prezioso, and Francesco Piccialli
Gon¸ calo M Correia, Vlad Niculae, and Andr´ e FT Martins. Adaptively sparse transformers. arXiv preprint arXiv:1909.00015 ,
-
[8]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Learning Factored Representations in a Deep Mixture of Experts
David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314 ,
-
[10]
Maskgan: Better text generation via filling in the
William Fedus, Ian Goodfellow, and Andrew M Dai. Maskgan: Better text generation via filling in the . arXiv preprint arXiv:1801.07736 ,
-
[11]
Sparse gpu kernels for deep learning
Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. Sparse gpu kernels for deep learning. arXiv preprint arXiv:2006.10901 ,
-
[12]
doi:10.48550/arXiv.2002.08909 , abstract =
36 Switch Transformers Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909 ,
-
[13]
PipeDream: Fast and Efficient Pipeline Parallel DNN Training
Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 ,
-
[14]
Distilling the Knowledge in a Neural Network
URL https://proceedings.neurips.cc/paper/2015/file/ afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 ,
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[15]
Sara Hooker. The hardware lottery. arXiv preprint arXiv:2009.06489 ,
-
[16]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 ,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[18]
Reformer: The Efficient Transformer
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451 ,
work page internal anchor Pith review arXiv 2001
-
[19]
Deduplicating training data makes language models better
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499 ,
-
[20]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668,
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[21]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745,
-
[23]
Adversarial nli: A new benchmark for natural language understanding
Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599,
-
[24]
Scalable transfer learning with expert models
Joan Puigcerver, Carlos Riquelme, Basil Mustafa, Cedric Renggli, Andr´ e Susano Pinto, Sylvain Gelly, Daniel Keysers, and Neil Houlsby. Scalable transfer learning with expert models. arXiv preprint arXiv:2009.13239 ,
-
[25]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 ,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[26]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory opti- mization towards training a trillion parameter models. arXiv preprint arXiv:1910.02054,
-
[27]
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
How Much Knowledge Can You Pack Into the Parameters of a Language Model?
38 Switch Transformers Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910 ,
work page internal anchor Pith review arXiv 2002
-
[29]
Routing networks: Adaptive selection of non-linear functions for multi-task learning
Clemens Rosenbaum, Tim Klinger, and Matthew Riemer. Routing networks: Adaptive selection of non-linear functions for multi-task learning. arXiv preprint arXiv:1711.01239,
-
[30]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hin- ton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. arXiv preprint arXiv:1701.06538 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053 ,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[32]
Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov
Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research , 15(1):1929–1958,
work page 1929
-
[33]
Energy and Policy Considerations for Deep Learning in NLP
URL http://www.cs. toronto.edu/~rsalakhu/papers/srivastava14a.pdf. Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243 ,
work page Pith review arXiv 1906
-
[34]
Adaptive attention span in transformers
Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. Adaptive attention span in transformers. arXiv preprint arXiv:1905.07799 ,
-
[35]
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bow- man. Glue: A multi-task benchmark and analysis platform for natural language under- standing. arXiv preprint arXiv:1804.07461 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
mt5: A massively multilingual pre-trained text-to-text transformer
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934 ,
-
[37]
A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al
Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santi- ago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. arXiv preprint arXiv:2007.14062 ,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.