super hub Canonical reference

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Andy Davis, Azalia Mirhoseini, Geoffrey Hinton, Krzysztof Maziarz, Noam Shazeer, Quoc Le · 2017 · cs.LG · arXiv 1701.06538

Canonical reference. 75% of citing Pith papers cite this work as background.

281 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 281 citing papers more from Andy Davis arXiv PDF

abstract

The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 39 method 10 baseline 2 dataset 1

citation-polarity summary

background 39 use method 9 baseline 2 support 1 use dataset 1

claims ledger

abstract The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational effic

authors

Andy Davis Azalia Mirhoseini Geoffrey Hinton Krzysztof Maziarz Noam Shazeer Quoc Le

co-cited works

representative citing papers

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

Agentic generation of verifiable rules for deterministic, self-expanding reaction classification

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

Multi-agent LLMs generate and verify 14,073 deterministic reaction rules from 665,901 patents, enabling 97.7% classification of unseen reactions with finer resolution than fixed proprietary systems.

FRAME: Learning the Adaptation Domain with a Mixture of Fractional-Fourier Experts

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

FRAME adds a learnable fractional-Fourier order per expert in a MoE-LoRA setup so that low-rank updates are placed in the domain where they are most compact, yielding gains over fixed-domain baselines on LLaMA-3.1-8B and Qwen2.5-7B.

Language-Assisted Super-Resolution from Real-World Low-Resolution Patches

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

LA-SR redefines unpaired super-resolution in language space by projecting images into a semantically rich representation and applying vision-language model guided losses to handle real-world degradations extracted from depth variations.

When Does Synthetic CT Transfer? A Label-Free Donor/Host Diagnostic for Medical Vision-Language Model Routing on Real Lung CT

cs.CV · 2026-06-28 · unverdicted · novelty 7.0

Donor-driven nodule properties in synthetic CT transfer to real lung CT vision-language tasks while host-driven anatomy properties do not, enabling a label-free diagnostic for model routing.

PCCL: Process Group-Aware Scalable and Generic Collective Algorithm Synthesizer

cs.DC · 2026-06-05 · unverdicted · novelty 7.0

PCCL synthesizes near-optimal topology-aware collective algorithms for arbitrary patterns while being process group-aware and scalable to subsets of devices.

Less is MoE: Trimming Experts in Domain-Specialist Language Models

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

Fisher-MoE prunes sparse intermediate dimensions in MoE FFNs ranked by Fisher importance, delivering 50% compression that preserves capability while cutting memory ~45% and raising throughput 21%.

Argus-Retriever: Vision-LLM Late-Interaction Retrieval with Region-Aware Query-Conditioned MoE for Visual Document Retrieval

cs.IR · 2026-06-03 · unverdicted · novelty 7.0

Argus achieves the highest reported NDCG scores among open late-interaction models on ViDoRe V1 and combined V1+V2 by introducing query-dependent document representations via a region-aware MoE on Qwen3.5-VL, trained on 9% of public data with a 1024-dim head.

ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving

cs.DC · 2026-05-30 · unverdicted · novelty 7.0

ViBE co-optimizes expert placement with measured GPU performance variability in MoE inference to cut execution-time imbalance, delivering 14% better SLO attainment and up to 45% lower P90 TTFT.

A Minimal Bifurcation Model of Load Imbalance in a Softmax Mixture-of-Experts Router

math.DS · 2026-05-27 · unverdicted · novelty 7.0

A mean-field limit of a reinforcement-based softmax router for two experts shows a supercritical pitchfork bifurcation, with an external asymmetry unfolding it into a cusp of fold bifurcations.

L2Rec: Towards Dual-View Understanding of LLMs for Personalized Recommendation

cs.IR · 2026-05-26 · unverdicted · novelty 7.0

L2Rec introduces dual-view personalized low-rank perturbations via DPMoE to let one LLM backbone produce complementary behavioral and semantic adaptations, with cross-view fusion, outperforming baselines on four datasets and in industrial A/B tests.

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

ArchSIBench is a new benchmark dataset and evaluation suite that measures vision-language models on architectural spatial intelligence across 17 subtasks, showing most models lag human baselines especially in transformation and configuration.

Beyond Routing: Characterising Expert Tuning and Representation in Vision Mixture-of-Experts

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

Expert specialization in vision MoE models is dominated by a stable animate-inanimate distinction visible from gating to readout, with broader tuning to continuous visual and semantic dimensions rather than narrow categorical preferences.

Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the primary bottleneck across six backbones.

Dynamic Chunking for Diffusion Language Models

cs.CL · 2026-05-15 · unverdicted · novelty 7.0

DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.

MuteBench: Modality Unavailability Tolerance Evaluation for Incomplete Multimodal Fusion

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

MuteBench evaluates multimodal fusion robustness to modality missing and within-modality missing on 125000 samples from 9 clinical datasets, finding architecture family predicts tolerance better than parameter count.

Vector-Quantized Discrete Latent Factors Meet Financial Priors: Dynamic Cross-Sectional Stock Ranking Prediction for Portfolio Construction

cs.LG · 2026-05-13 · conditional · novelty 7.0

PRISM-VQ integrates vector-quantized latent factors with financial priors and a structure-conditioned mixture-of-experts to deliver improved cross-sectional stock return predictions and portfolio performance on CSI 300 and S&P 500.

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

cs.DC · 2026-05-11 · unverdicted · novelty 7.0

EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.

SDG-MoE: Signed Debate Graph Mixture-of-Experts

cs.LG · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

SDG-MoE introduces learned signed interaction graphs and disagreement-gated deliberation among experts in MoE architectures, yielding 19.8% better validation perplexity than the strongest baseline.

Approximation-Free Differentiable Oblique Decision Trees

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

DTSemNet gives an exact, invertible neural-network encoding of hard oblique decision trees that supports direct gradient training for both classification and regression without probabilistic softening or quantized estimators.

citing papers explorer

Showing 2 of 2 citing papers after filters.

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding cs.CL · 2020-06-30 · unverdicted · none · ref 16 · internal anchor
GShard supplies automatic sharding and conditional computation support that enabled training a 600-billion-parameter multilingual translation model on thousands of TPUs with superior quality.
A Survey on Large Language Models for Code Generation cs.CL · 2024-06-01 · unverdicted · none · ref 238 · internal anchor
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer