A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
representative citing papers
GNMT deploys 8-layer LSTMs with attention, wordpieces, low-precision inference, and coverage-penalized beam search to match state-of-the-art on WMT'14 En-Fr and En-De while cutting translation errors by 60% in human evaluations.
Pith review generated a malformed one-line summary.
citing papers explorer
-
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.