A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
citation dossier
Adam: A method for stochastic optimization
2Pith papers citing it
2reference links
cs.CLtop field · 1 papers
ACCEPTtop verdict bucket · 1 papers
why this work matters in Pith
Pith has found this work in 2 reviewed papers. Its strongest current cluster is cs.CL (1 papers). The largest review-status bucket among citing papers is ACCEPT (1 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.
years
2017 2representative citing papers
Pith review generated a malformed one-line summary.
citing papers explorer
-
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
-
Attention Is All You Need
Pith review generated a malformed one-line summary.