A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
Adam: A method for stochastic optimization
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2017 2representative citing papers
Pith review generated a malformed one-line summary.
citing papers explorer
-
Attention Is All You Need
Pith review generated a malformed one-line summary.