super hub

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Andy Davis, Azalia Mirhoseini, Geoffrey Hinton, Krzysztof Maziarz, Noam Shazeer, Quoc Le · 2017 · cs.LG · arXiv 1701.06538

121 Pith papers cite this work. Polarity classification is still indexing.

121 Pith papers citing it

open full Pith review browse 121 citing papers more from Andy Davis arXiv PDF

abstract

The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

claims ledger

abstract The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational effic

authors

Andy Davis Azalia Mirhoseini Geoffrey Hinton Krzysztof Maziarz Noam Shazeer Quoc Le

co-cited works

representative citing papers

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

Vector-Quantized Discrete Latent Factors Meet Financial Priors: Dynamic Cross-Sectional Stock Ranking Prediction for Portfolio Construction

cs.LG · 2026-05-13 · conditional · novelty 7.0

PRISM-VQ integrates vector-quantized latent factors with financial priors and a structure-conditioned mixture-of-experts to deliver improved cross-sectional stock return predictions and portfolio performance on CSI 300 and S&P 500.

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

cs.DC · 2026-05-11 · unverdicted · novelty 7.0

EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.

Test-Time Speculation

cs.CL · 2026-05-10 · unverdicted · novelty 7.0

Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.

Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation

cs.DC · 2026-05-08 · unverdicted · novelty 7.0

Dooly reduces LLM inference profiling costs by 56.4% via configuration-agnostic taint-based labeling and selective database reuse, delivering simulation accuracy within 5% MAPE for TTFT and 8% for TPOT across 12 models.

SDG-MoE: Signed Debate Graph Mixture-of-Experts

cs.LG · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

SDG-MoE introduces learned signed interaction graphs and disagreement-gated deliberation among experts in MoE architectures, yielding 19.8% better validation perplexity than the strongest baseline.

Approximation-Free Differentiable Oblique Decision Trees

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

DTSemNet gives an exact, invertible neural-network encoding of hard oblique decision trees that supports direct gradient training for both classification and regression without probabilistic softening or quantized estimators.

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

cs.LG · 2026-05-08 · conditional · novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.

SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

SplatWeaver dynamically allocates Gaussian primitives via cardinality experts and pixel-level routing guided by high-frequency cues for improved generalizable novel view synthesis.

When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.

TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals

cs.CR · 2026-05-08 · unverdicted · novelty 7.0

TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

MoLF routes updates between full fine-tuning and LoRA at the optimizer level to match or exceed the better of either static method, with an efficient LoRA-only variant outperforming prior adaptive approaches.

Continuous Expert Assembly: Instance-Conditioned Low-Rank Residuals for All-in-One Image Restoration

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

CEA assembles per-token low-rank residual updates via dense affinities over hyper-adapter-generated components to improve all-in-one image restoration on spatially non-uniform degradations.

Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend

cs.DC · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

A buffer-free MoE dispatch and combine method on Ascend hardware with pooled HBM cuts intermediate relay overhead via direct expert window access.

Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs

cs.CR · 2026-05-06 · unverdicted · novelty 7.0

Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.

Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts

cs.LG · 2026-05-01 · conditional · novelty 7.0

Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant interaction.

RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

cs.CR · 2026-04-30 · unverdicted · novelty 7.0

MASCing uses an LSTM surrogate and optimized steering masks to enable flexible, inference-time control over MoE expert routing for safety objectives, improving jailbreak defense and content generation success rates substantially across multiple models.

Adaptive and Fine-grained Module-wise Expert Pruning for Efficient LoRA-MoE Fine-Tuning

cs.LG · 2026-04-29 · unverdicted · novelty 7.0

DMEP prunes experts module-by-module in LoRA-MoE and removes load balancing after pruning, cutting trainable parameters 35-43% and raising throughput ~10% while matching or exceeding uniform baselines on reasoning tasks.

Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x higher throughput.

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

cs.LG · 2026-04-21 · unverdicted · novelty 7.0 · 2 refs

Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.

citing papers explorer

Showing 10 of 10 citing papers after filters.

Using large language models for embodied planning introduces systematic safety risks cs.AI · 2026-04-20 · unverdicted · none · ref 9 · internal anchor
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality cs.AI · 2026-04-15 · conditional · none · ref 2 · internal anchor
Routing topology in sparse Mixture-of-Experts models does not determine asymptotic language modeling perplexity; multiple variants including cosine-similarity routing achieve statistically equivalent performance.
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction cs.AI · 2026-05-13 · unverdicted · none · ref 37 · internal anchor
Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
Domain Restriction via Multi SAE Layer Transitions cs.AI · 2026-05-12 · unverdicted · none · ref 25 · internal anchor
Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models cs.AI · 2026-05-05 · unverdicted · none · ref 50 · internal anchor
MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.
Geometric Routing Enables Causal Expert Control in Mixture of Experts cs.AI · 2026-04-15 · unverdicted · none · ref 20 · internal anchor
Cosine-similarity routing in low-dimensional space makes MoE experts monosemantic by construction and enables direct causal control via centroid interventions.
STAR: Failure-Aware Markovian Routing for Multi-Agent Spatiotemporal Reasoning cs.AI · 2026-05-11 · unverdicted · none · ref 21 · 2 links · internal anchor
STAR is a failure-aware Markovian router that learns recovery transitions from both successful and unsuccessful execution traces to improve multi-agent performance on spatiotemporal benchmarks.
Complexity Horizons of Compressed Models in Analog Circuit Analysis cs.AI · 2026-05-04 · unverdicted · none · ref 24 · internal anchor
Prerequisite graphs map compressed LLM performance boundaries in analog circuit analysis to allow selecting the smallest viable model for a given task complexity.
Enhancing Online Recruitment with Category-Aware MoE and LLM-based Data Augmentation cs.AI · 2026-04-23 · unverdicted · none · ref 36 · internal anchor
LLM chain-of-thought rewriting of job postings plus category-aware MoE improves person-job fit AUC by 2.4%, GAUC by 7.5%, and live click-through conversion by 19.4%.
Benchmarking PNW Model for MedMNIST to 100% Accuracy cs.AI · 2026-04-20 · unverdicted · none · ref 13 · internal anchor
A new 'Artificial Special Intelligence' method is claimed to enable error-free training of classification models to 100% accuracy on 15 of 18 MedMNIST biomedical datasets.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer