Recognition: 2 theorem links
· Lean TheoremST-MoE: Designing Stable and Transferable Sparse Expert Models
Pith reviewed 2026-05-12 23:09 UTC · model grok-4.3
The pith
A sparse mixture-of-experts model achieves state-of-the-art transfer learning performance for the first time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper shows that targeted modifications to router stability and capacity factors, together with adjusted training procedures, allow sparse expert models to train reliably and to achieve state-of-the-art transfer results across a broad suite of natural language tasks, including SuperGLUE, ARC, XSum, CNN-DM, WebQA, Natural Questions, Winogrande, and ANLI R3.
What carries the argument
Router stability techniques combined with tuned capacity factors that maintain balanced expert utilization and prevent training collapse during both pre-training and fine-tuning.
If this is right
- Sparse models can be scaled to hundreds of billions of parameters while remaining trainable and transferable.
- Fine-tuning quality becomes consistent enough for production use across reasoning and summarization tasks.
- Inference cost drops relative to dense models of comparable capability because only a fraction of experts activate per token.
- Energy-efficient scaling paths open for language models without sacrificing benchmark leadership.
Where Pith is reading between the lines
- The same stability fixes may allow even larger sparse models to be trained successfully beyond 269 billion parameters.
- The design principles could be tested on other sparse routing architectures to check whether they generalize.
- Adopting these procedures might reduce the practical barrier to deploying high-capacity models in resource-constrained settings.
Load-bearing premise
The observed stability and transfer gains come primarily from the described router and capacity choices rather than from unmentioned factors such as data selection or optimizer details.
What would settle it
A replication that applies the same router stability and capacity rules yet still encounters training collapse or fails to match the reported scores on SuperGLUE, XSum, or Natural Questions.
read the original abstract
Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ST-MoE, a set of design choices for Mixture-of-Experts models aimed at improving training stability and transfer performance. Key elements include router z-loss, capacity factor scheduling, and auxiliary losses. The authors scale a sparse model to 269B parameters (ST-MoE-32B) whose training compute matches a 32B dense encoder-decoder Transformer and report that it achieves state-of-the-art results on a broad suite of transfer tasks: SuperGLUE, ARC Easy/Challenge, XSum, CNN-DM, WebQA, Natural Questions, Winogrande, and ANLI R3. The work is framed as a practical design guide for stable sparse models.
Significance. If the headline transfer results prove robust and the gains can be isolated to the proposed MoE-specific techniques, the paper would be significant: it would be the first demonstration that a sparse model can reach SOTA across diverse transfer benchmarks while retaining the inference efficiency of sparsity. The scaling result and the explicit design-guide framing also provide concrete, reusable guidance for practitioners.
major comments (2)
- [§4 and §5] §4 (Experiments) and §5 (Results): The central claim that the router z-loss, capacity-factor schedule, and auxiliary losses are the decisive factors enabling stable pretraining and SOTA transfer is not supported by a controlled comparison. No table or subsection holds the pretraining corpus, data mixture, and optimizer schedule fixed while toggling only the MoE components against an otherwise identical dense baseline. Without this isolation the attribution of stability and transfer gains remains confounded.
- [Table 1 and Table 2] Table 1 and Table 2: Reported scores for ST-MoE-32B on SuperGLUE, ARC, and summarization tasks lack error bars, standard deviations, or results from multiple random seeds. Given the known sensitivity of large-model fine-tuning, single-run numbers are insufficient to substantiate the “state-of-the-art” claim or to allow readers to assess whether the reported margins are reliable.
minor comments (2)
- [§3.2] §3.2: The definition of the router z-loss is clear, but the text does not state the exact coefficient used in the final runs; adding this hyper-parameter value would improve reproducibility.
- [Figure 4] Figure 4: The capacity-factor scheduling plot would benefit from an explicit legend indicating which curve corresponds to the final ST-MoE-32B configuration.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The comments help clarify the scope of our claims and the evidence needed to support them. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [§4 and §5] §4 (Experiments) and §5 (Results): The central claim that the router z-loss, capacity-factor schedule, and auxiliary losses are the decisive factors enabling stable pretraining and SOTA transfer is not supported by a controlled comparison. No table or subsection holds the pretraining corpus, data mixture, and optimizer schedule fixed while toggling only the MoE components against an otherwise identical dense baseline. Without this isolation the attribution of stability and transfer gains remains confounded.
Authors: We agree that a fully isolated ablation—holding the exact pretraining corpus, data mixture, and optimizer schedule fixed while comparing only the addition of our MoE-specific techniques against an otherwise identical dense model—would provide the cleanest attribution. Our Section 4 ablations do isolate the effect of each individual technique (router z-loss, capacity-factor scheduling, auxiliary losses) on stability and downstream metrics while keeping the rest of the MoE architecture fixed, but these are performed within the sparse setting rather than against a matched dense baseline. The primary comparisons in Section 5 are to published dense models of comparable training compute. In the revised manuscript we will add an explicit limitations paragraph in Section 5 acknowledging that the reported gains are those of the full ST-MoE recipe versus published dense baselines, and we will clarify that the individual technique ablations demonstrate necessity within the sparse regime but do not constitute a controlled dense-versus-sparse experiment. revision: partial
-
Referee: [Table 1 and Table 2] Table 1 and Table 2: Reported scores for ST-MoE-32B on SuperGLUE, ARC, and summarization tasks lack error bars, standard deviations, or results from multiple random seeds. Given the known sensitivity of large-model fine-tuning, single-run numbers are insufficient to substantiate the “state-of-the-art” claim or to allow readers to assess whether the reported margins are reliable.
Authors: We acknowledge that single-run fine-tuning results at this scale limit the ability to quantify statistical reliability. Training and evaluating the 269 B-parameter model multiple times is computationally prohibitive. In the revised version we will (1) add a short discussion in Section 5 noting this limitation and referencing the variance observed across random seeds in our smaller-scale ablations (reported in the appendix), and (2) qualify the “state-of-the-art” language to “competitive with or exceeding prior published single-run results” where appropriate. We will not be able to add error bars from multiple full-scale runs. revision: partial
Circularity Check
Empirical design guide with no derivation chain or self-referential predictions
full rationale
The paper is an empirical contribution focused on training instabilities and fine-tuning quality in large MoE models. It reports scaling results to 269B parameters and SOTA transfer performance on external benchmarks (SuperGLUE, ARC, XSum, etc.). No mathematical derivations, equations, fitted parameters presented as predictions, or load-bearing self-citations appear in the provided text. Central claims are framed as experimental outcomes of design choices rather than quantities that reduce to inputs by construction. The work is self-contained against external benchmarks and does not invoke uniqueness theorems, ansatzes smuggled via citation, or renamings of known results. This matches the default expectation of no significant circularity.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 27 Pith papers
-
Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
-
Mixture of Layers with Hybrid Attention
Mixture of Layers replaces monolithic transformer blocks with routed thin parallel blocks using hybrid attention that combines a shared softmax block for global context with Gated DeltaNet linear attention in the rout...
-
SDG-MoE: Signed Debate Graph Mixture-of-Experts
SDG-MoE introduces learned signed interaction graphs and disagreement-gated deliberation among experts in MoE architectures, yielding 19.8% better validation perplexity than the strongest baseline.
-
SDG-MoE: Signed Debate Graph Mixture-of-Experts
SDG-MoE adds learned support and critique graphs plus disagreement-gated message passing to MoE models, yielding 19.8% better validation perplexity than the strongest baseline in three-seed pretraining.
-
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
-
Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts
Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant...
-
Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality
Routing topology in sparse Mixture-of-Experts models does not determine asymptotic language modeling perplexity; multiple variants including cosine-similarity routing achieve statistically equivalent performance.
-
Jamba: A Hybrid Transformer-Mamba Language Model
Jamba presents a hybrid Transformer-Mamba MoE architecture for LLMs that delivers state-of-the-art benchmark performance and strong results up to 256K token contexts while fitting in one 80GB GPU with high throughput.
-
Sparse Layers are Critical to Scaling Looped Language Models
Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.
-
Hierarchical Mixture-of-Experts with Two-Stage Optimization
Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and v...
-
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
-
Cumulative-Goodness Free-Riding in Forward-Forward Networks: Real, Repairable, but Not Accuracy-Dominant
Cumulative-goodness Forward-Forward networks exhibit layer free-riding where discrimination gradients decay exponentially with prior positive margins; per-block, hardness-gated, and depth-scaled remedies yield 4-45x b...
-
ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.
-
Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns
Profiling shows persistent expert load imbalance and domain-specific activation patterns in large MoE models; workload-aware grouping and placement reduce all-to-all communication volume by up to 20x.
-
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.
-
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
-
Emergent Abilities of Large Language Models
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
-
PaLM: Scaling Language Modeling with Pathways
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
-
Teacher-Guided Routing for Sparse Vision Mixture-of-Experts
Teacher-guided routing supplies pseudo-supervision from a dense model's intermediate features to stabilize expert selection in sparse vision MoE models.
-
Revisiting Auxiliary Losses for Conditional Depth Routing: An Empirical Study
Removing utility regression and rank supervision auxiliary losses improves language modeling performance and training efficiency for conditional depth routing gates, and eliminates the advantage of a more complex JEPA...
-
Does a Global Perspective Help Prune Sparse MoEs Elegantly?
GRAPE is a global redundancy-aware pruning strategy for sparse MoEs that dynamically allocates pruning budgets across layers and improves average accuracy by 1.40% over the best local baseline across tested models and...
-
OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment
OneRec unifies retrieval and ranking in a generative recommender using session-wise decoding and iterative DPO-based preference alignment, achieving real-world gains on Kuaishou.
-
PaLM 2 Technical Report
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
-
EMO: Frustratingly Easy Progressive Training of Extendable MoE
EMO progressively expands the expert pool in MoE models using scaling-law-derived token budgets per stage, matching fixed-expert performance while cutting wall-clock time and GPU cost.
-
Token Economics for LLM Agents: A Dual-View Study from Computing and Economics
The paper delivers a unified survey of token economics for LLM agents, conceptualizing tokens as production factors, exchange mediums, and units of account across micro, meso, macro, and security dimensions using esta...
-
Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization
HILBERT uses joint-centric dual contrastive learning with CKA and mutual information regularizers to align long-sequence audio-text embeddings while preserving structure and balancing modalities.
-
Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models
Gemma-4-E4B with few-shot chain-of-thought reaches the highest weighted accuracy of 0.675 at 14.9 GB VRAM, while the larger Gemma-4-26B-A4B MoE model scores 0.663 but uses 48.1 GB.
Reference graph
Works this paper leans on
-
[1]
Adaptive mixtures of local experts , author=. Neural computation , volume=. 1991 , publisher=
work page 1991
-
[2]
Hierarchical mixtures of experts and the EM algorithm , author=. Neural computation , volume=. 1994 , publisher=
work page 1994
-
[6]
12th \ USENIX \ symposium on operating systems design and implementation ( \ OSDI \ 16) , pages=
Tensorflow: A system for large-scale machine learning , author=. 12th \ USENIX \ symposium on operating systems design and implementation ( \ OSDI \ 16) , pages=
-
[7]
Advances in Neural Information Processing Systems , pages=
Mesh-tensorflow: Deep learning for supercomputers , author=. Advances in Neural Information Processing Systems , pages=
-
[8]
Improving language understanding by generative pre-training , author=
-
[14]
Proceedings of the 44th annual international symposium on computer architecture , pages=
In-datacenter performance analysis of a tensor processing unit , author=. Proceedings of the 44th annual international symposium on computer architecture , pages=
-
[18]
International conference on machine learning , pages=
On the difficulty of training recurrent neural networks , author=. International conference on machine learning , pages=. 2013 , organization=
work page 2013
-
[19]
International conference on machine learning , pages=
Batch normalization: Accelerating deep network training by reducing internal covariate shift , author=. International conference on machine learning , pages=. 2015 , organization=
work page 2015
-
[21]
Advances in neural information processing systems , volume=
Weight normalization: A simple reparameterization to accelerate training of deep neural networks , author=. Advances in neural information processing systems , volume=
-
[23]
BFloat16: The secret to high performance on Cloud TPUs , author=. Google Cloud Blog , year=
-
[24]
Introducing Pathways: A next-generation AI architecture , author=. Google AI Blog , year=
-
[27]
Distilling the Knowledge in a Neural Network
Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Reinforcement learning: An introduction , author=
-
[29]
Advances in Neural Information Processing Systems , pages=
Large memory layers with product keys , author=. Advances in Neural Information Processing Systems , pages=
-
[30]
Advances in neural information processing systems , pages=
Attention is all you need , author=. Advances in neural information processing systems , pages=
-
[31]
and Krizhevsky, Alex and Sutskever, Ilya and Salakhutdinov, Ruslan , biburl =
Srivastava, Nitish and Hinton, Geoffrey E. and Krizhevsky, Alex and Sutskever, Ilya and Salakhutdinov, Ruslan , biburl =. Dropout: a simple way to prevent neural networks from overfitting. , url =. Journal of Machine Learning Research , keywords =
-
[32]
Extracting Training Data from Large Language Models , author=. 2020 , eprint=
work page 2020
-
[35]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Winogrande: An adversarial winograd schema challenge at scale , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[37]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi
Big bird: Transformers for longer sequences , author=. arXiv preprint arXiv:2007.14062 , year=
-
[38]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Megatron-lm: Training multi-billion parameter language models using gpu model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[39]
Teaching Machines to Read and Comprehend , url =
Hermann, Karl Moritz and Kocisky, Tomas and Grefenstette, Edward and Espeholt, Lasse and Kay, Will and Suleyman, Mustafa and Blunsom, Phil , booktitle =. Teaching Machines to Read and Comprehend , url =
-
[40]
XLNet: Generalized Autoregressive Pretraining for Language Understanding , author=. 2020 , eprint=
work page 2020
-
[41]
Communications of the ACM , volume=
Roofline: an insightful visual performance model for multicore architectures , author=. Communications of the ACM , volume=. 2009 , publisher=
work page 2009
- [42]
-
[43]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. 2019 , eprint=
work page 2019
-
[44]
An Overview of Multi-Task Learning in Deep Neural Networks
An overview of multi-task learning in deep neural networks , author=. arXiv preprint arXiv:1706.05098 , year=
work page internal anchor Pith review arXiv
-
[45]
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Glue: A multi-task benchmark and analysis platform for natural language understanding , author=. arXiv preprint arXiv:1804.07461 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
Advances in Neural Information Processing Systems , pages=
Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in Neural Information Processing Systems , pages=
-
[47]
Advances in neural information processing systems , pages=
Teaching machines to read and comprehend , author=. Advances in neural information processing systems , pages=
-
[49]
“Cloze procedure”: A new tool for measuring readability , author=. Journalism quarterly , volume=. 1953 , publisher=
work page 1953
-
[52]
The Hardware Lottery , author=. arXiv preprint arXiv:2009.06489 , year=
-
[53]
International Conference on Learning Representations , year=
Diversity and depth in per-example routing models , author=. International Conference on Learning Representations , year=
-
[54]
Transactions of the Association for Computational Linguistics , volume=
Natural questions: a benchmark for question answering research , author=. Transactions of the Association for Computational Linguistics , volume=. 2019 , publisher=
work page 2019
-
[55]
Proceedings of the 2013 conference on empirical methods in natural language processing , pages=
Semantic parsing on freebase from question-answer pairs , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=
work page 2013
-
[57]
Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=
work page 1997
-
[58]
Generating Long Sequences with Sparse Transformers
Generating long sequences with sparse transformers , author=. arXiv preprint arXiv:1904.10509 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[59]
Longformer: The Long-Document Transformer
Longformer: The long-document transformer , author=. arXiv preprint arXiv:2004.05150 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[60]
Routing networks: Adaptive selection of non-linear functions for multi-task learning
Routing networks: Adaptive selection of non-linear functions for multi-task learning , author=. arXiv preprint arXiv:1711.01239 , year=
-
[61]
Rich Sutton , title =
-
[62]
PipeDream: Fast and Efficient Pipeline Parallel DNN Training
Pipedream: Fast and efficient pipeline parallel dnn training , author=. arXiv preprint arXiv:1806.03377 , year=
-
[64]
Scalable transfer learning with expert models
Scalable transfer learning with expert models , author=. arXiv preprint arXiv:2009.13239 , year=
-
[65]
Exponentially increasing the capacity-to-computation ratio for conditional computation in deep learning , author=. arXiv preprint arXiv:1406.7362 , year=
-
[66]
Reformer: The Efficient Transformer
Reformer: The efficient transformer , author=. arXiv preprint arXiv:2001.04451 , year=
work page internal anchor Pith review arXiv 2001
-
[67]
Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures , pages=
Integrated model, batch, and domain parallelism in training neural networks , author=. Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures , pages=
-
[68]
Gpu kernels for block-sparse weights , author=
-
[69]
Sparse gpu kernels for deep learning
Sparse GPU Kernels for Deep Learning , author=. arXiv preprint arXiv:2006.10901 , year=
-
[70]
arXiv preprint arXiv:1802.08435 , year=
Efficient neural audio synthesis , author=. arXiv preprint arXiv:1802.08435 , year=
-
[71]
Energy and policy considerations for deep learning in NLP , author=. arXiv preprint arXiv:1906.02243 , year=
-
[74]
Advances in neural information processing systems , pages=
Gpipe: Efficient training of giant neural networks using pipeline parallelism , author=. Advances in neural information processing systems , pages=
-
[75]
doi:10.48550/arXiv.2002.08909 , abstract =
Realm: Retrieval-augmented language model pre-training , author=. arXiv preprint arXiv:2002.08909 , year=
-
[77]
Bulletin of the American Mathematical Society , volume=
Some aspects of the sequential design of experiments , author=. Bulletin of the American Mathematical Society , volume=. 1952 , publisher=
work page 1952
-
[78]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He
Zero: Memory optimization towards training a trillion parameter models , author=. arXiv preprint arXiv:1910.02054 , year=
-
[85]
International conference on machine learning , pages=
Language modeling with gated convolutional networks , author=. International conference on machine learning , pages=. 2017 , organization=
work page 2017
-
[86]
Rectified linear units improve restricted boltzmann machines , author=. Icml , year=
-
[88]
International Conference on Machine Learning , pages=
Adafactor: Adaptive learning rates with sublinear memory cost , author=. International Conference on Machine Learning , pages=. 2018 , organization=
work page 2018
-
[90]
8-bit Optimizers via Block-wise Quantization , author=. 2021 , eprint=
work page 2021
-
[92]
Antifragile: Things that gain from disorder , author=. 2012 , publisher=
work page 2012
-
[95]
Adaptively sparse transformers
Adaptively sparse transformers , author=. arXiv preprint arXiv:1909.00015 , year=
-
[96]
Adaptive attention span in transformers
Adaptive attention span in transformers , author=. arXiv preprint arXiv:1905.07799 , year=
-
[97]
Journal of Machine Learning Research , volume=
Beyond english-centric multilingual machine translation , author=. Journal of Machine Learning Research , volume=
-
[98]
proceedings of Sinn und Bedeutung , volume=
The CommitmentBank: Investigating projection in naturally occurring discourse , author=. proceedings of Sinn und Bedeutung , volume=
-
[100]
Deep Residual Learning for Image Recognition , author=. 2015 , eprint=
work page 2015
-
[101]
Advances in neural information processing systems , volume=
A simple weight decay can improve generalization , author=. Advances in neural information processing systems , volume=
- [102]
-
[104]
arXiv preprint arXiv:2107.02137 , year=
ERNIE 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation , author=. arXiv preprint arXiv:2107.02137 , year=
-
[105]
Scalable and Efficient MoE Training for Multitask Multilingual Models , author=. 2021 , eprint=
work page 2021
-
[106]
Taming Sparsely Activated Transformer with Stochastic Experts , author=. 2021 , eprint=
work page 2021
-
[107]
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , author=. 2021 , eprint=
work page 2021
-
[108]
Efficient Large Scale Language Modeling with Mixtures of Experts , author=. 2021 , eprint=
work page 2021
-
[109]
On the Opportunities and Risks of Foundation Models
On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[110]
Sparse-MLP: A Fully-MLP Architecture with Conditional Computation , author=. 2021 , eprint=
work page 2021
-
[111]
M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining , author=. 2021 , eprint=
work page 2021
-
[112]
R-Drop: Regularized Dropout for Neural Networks , author=. 2021 , eprint=
work page 2021
-
[113]
Dense Passage Retrieval for Open-Domain Question Answering , author=. 2020 , eprint=
work page 2020
-
[114]
PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , author=. 2020 , eprint=
work page 2020
-
[115]
M6-T: Exploring Sparse Expert Models and Beyond , author=. 2021 , eprint=
work page 2021
-
[116]
SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts , author=. 2021 , eprint=
work page 2021
-
[117]
SpeechMoE2: Mixture-of-Experts Model with Improved Routing , author=. 2021 , eprint=
work page 2021
-
[118]
Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition , author=. 2021 , eprint=
work page 2021
-
[119]
Modeling task relationships in multi-task learning with multi-gate mixture-of-experts , author=. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.