{"work":{"id":"2c6b3f6d-54e4-4df7-baa7-475a490799af","openalex_id":null,"doi":null,"arxiv_id":"1701.06538","raw_key":null,"title":"Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer","authors":null,"authors_text":"Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton","year":2017,"venue":"cs.LG","abstract":"The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.","external_url":"https://arxiv.org/abs/1701.06538","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-15T00:38:23.065034+00:00","pith_arxiv_id":"1701.06538","created_at":"2026-05-08T17:13:38.657778+00:00","updated_at":"2026-05-15T00:38:23.065034+00:00","title_quality_ok":true,"display_title":"Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer","render_title":"Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer"},"hub":{"state":{"work_id":"2c6b3f6d-54e4-4df7-baa7-475a490799af","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":124,"external_cited_by_count":null,"distinct_field_count":18,"first_pith_cited_at":"2017-06-12T17:57:34+00:00","last_pith_cited_at":"2026-05-13T16:48:24+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-15T01:56:18.660396+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":1}],"polarity_counts":[{"context_polarity":"background","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer","claims":[{"claim_text":"The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational effic","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T01:03:57.813365+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"d555cf92-dbb8-4cd6-b9cc-3a82d85183de","orcid":null,"display_name":"Noam Shazeer"},{"id":"0fbbd93e-d17b-4e8d-95b5-28c0acb5edbc","orcid":null,"display_name":"Azalia Mirhoseini"},{"id":"21a391cf-e23e-4e65-bd19-04fe56190765","orcid":null,"display_name":"Krzysztof Maziarz"},{"id":"6d44b008-7d32-4fa4-ae41-64cc3944ae4b","orcid":null,"display_name":"Andy Davis"},{"id":"b7d93d8f-b3b6-4571-b881-819e8f1bae2d","orcid":null,"display_name":"Quoc Le"},{"id":"8b2cc23b-b0d5-44c1-9db1-00fbb88c798c","orcid":null,"display_name":"Geoffrey Hinton"}]},"error":null,"updated_at":"2026-05-14T01:03:58.455624+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T01:04:03.082839+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding","work_id":"52b3c9a6-2a27-45a7-ba2b-ebe4b5bb5a5f","shared_citers":32},{"title":"Mixtral of Experts","work_id":"0de8c352-9daa-4e1e-8c7b-3d0dec69f369","shared_citers":30},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":27},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":26},{"title":"DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models","work_id":"a9888d6d-bf47-4324-9834-7cc12ac3a78c","shared_citers":19},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":19},{"title":"Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity","work_id":"f43c4955-a965-4897-a11b-c4b25d2aeaa8","shared_citers":13},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":13},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":13},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":12},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":12},{"title":"gpt-oss-120b & gpt-oss-20b Model Card","work_id":"178c1f7e-4f19-4392-a45d-45a6dfa88ead","shared_citers":11},{"title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism","work_id":"c888e6d1-0b1d-43d6-9ef5-f0912a0efa1b","shared_citers":11},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":10},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":10},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":10},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":10},{"title":"ST-MoE: Designing Stable and Transferable Sparse Expert Models","work_id":"b7581741-3f43-4528-a7d0-3af9e51a4d9f","shared_citers":10},{"title":"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model","work_id":"1e1df141-cac8-47fd-b068-c4c96e51e331","shared_citers":9},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":8},{"title":"arXiv preprint arXiv:2408.15664 , year=","work_id":"267500ca-1512-478f-8a1b-6ecbdb09771d","shared_citers":8},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":8},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":8},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":8}],"time_series":[{"n":1,"year":2017},{"n":1,"year":2019},{"n":2,"year":2020},{"n":1,"year":2021},{"n":1,"year":2022},{"n":2,"year":2023},{"n":4,"year":2024},{"n":1,"year":2025},{"n":103,"year":2026}]},"error":null,"updated_at":"2026-05-14T01:04:03.142126+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T01:03:57.807565+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer","claims":[{"claim_text":"The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational effic","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T01:03:55.287417+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer","claims":[{"claim_text":"The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational effic","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T01:03:58.458360+00:00"}},"summary":{"title":"Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer","claims":[{"claim_text":"The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational effic","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding","work_id":"52b3c9a6-2a27-45a7-ba2b-ebe4b5bb5a5f","shared_citers":32},{"title":"Mixtral of Experts","work_id":"0de8c352-9daa-4e1e-8c7b-3d0dec69f369","shared_citers":30},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":27},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":26},{"title":"DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models","work_id":"a9888d6d-bf47-4324-9834-7cc12ac3a78c","shared_citers":19},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":19},{"title":"Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity","work_id":"f43c4955-a965-4897-a11b-c4b25d2aeaa8","shared_citers":13},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":13},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":13},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":12},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":12},{"title":"gpt-oss-120b & gpt-oss-20b Model Card","work_id":"178c1f7e-4f19-4392-a45d-45a6dfa88ead","shared_citers":11},{"title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism","work_id":"c888e6d1-0b1d-43d6-9ef5-f0912a0efa1b","shared_citers":11},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":10},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":10},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":10},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":10},{"title":"ST-MoE: Designing Stable and Transferable Sparse Expert Models","work_id":"b7581741-3f43-4528-a7d0-3af9e51a4d9f","shared_citers":10},{"title":"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model","work_id":"1e1df141-cac8-47fd-b068-c4c96e51e331","shared_citers":9},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":8},{"title":"arXiv preprint arXiv:2408.15664 , year=","work_id":"267500ca-1512-478f-8a1b-6ecbdb09771d","shared_citers":8},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":8},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":8},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":8}],"time_series":[{"n":1,"year":2017},{"n":1,"year":2019},{"n":2,"year":2020},{"n":1,"year":2021},{"n":1,"year":2022},{"n":2,"year":2023},{"n":4,"year":2024},{"n":1,"year":2025},{"n":103,"year":2026}]},"authors":[{"id":"6d44b008-7d32-4fa4-ae41-64cc3944ae4b","orcid":null,"display_name":"Andy Davis","source":"manual","import_confidence":0.72},{"id":"0fbbd93e-d17b-4e8d-95b5-28c0acb5edbc","orcid":null,"display_name":"Azalia Mirhoseini","source":"manual","import_confidence":0.72},{"id":"8b2cc23b-b0d5-44c1-9db1-00fbb88c798c","orcid":null,"display_name":"Geoffrey Hinton","source":"manual","import_confidence":0.72},{"id":"21a391cf-e23e-4e65-bd19-04fe56190765","orcid":null,"display_name":"Krzysztof Maziarz","source":"manual","import_confidence":0.72},{"id":"d555cf92-dbb8-4cd6-b9cc-3a82d85183de","orcid":null,"display_name":"Noam Shazeer","source":"manual","import_confidence":0.72},{"id":"b7d93d8f-b3b6-4571-b881-819e8f1bae2d","orcid":null,"display_name":"Quoc Le","source":"manual","import_confidence":0.72}]}}