Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
super hub
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
118 Pith papers cite this work. Polarity classification is still indexing.
abstract
The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.
hub tools
claims ledger
- abstract The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational effic
authors
co-cited works
representative citing papers
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.
Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.
Dooly reduces LLM inference profiling costs by 56.4% via configuration-agnostic taint-based labeling and selective database reuse, delivering simulation accuracy within 5% MAPE for TTFT and 8% for TPOT across 12 models.
SDG-MoE introduces learned signed interaction graphs and disagreement-gated deliberation among experts in MoE architectures, yielding 19.8% better validation perplexity than the strongest baseline.
DTSemNet gives an exact, invertible neural-network encoding of hard oblique decision trees that supports direct gradient training for both classification and regression without probabilistic softening or quantized estimators.
MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
SplatWeaver dynamically allocates Gaussian primitives via cardinality experts and pixel-level routing guided by high-frequency cues for improved generalizable novel view synthesis.
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.
MoLF routes updates between full fine-tuning and LoRA at the optimizer level to match or exceed the better of either static method, with an efficient LoRA-only variant outperforming prior adaptive approaches.
CEA assembles per-token low-rank residual updates via dense affinities over hyper-adapter-generated components to improve all-in-one image restoration on spatially non-uniform degradations.
A buffer-free MoE dispatch and combine method on Ascend hardware with pooled HBM cuts intermediate relay overhead via direct expert window access.
Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant interaction.
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.
MASCing uses an LSTM surrogate and optimized steering masks to enable flexible, inference-time control over MoE expert routing for safety objectives, improving jailbreak defense and content generation success rates substantially across multiple models.
DMEP prunes experts module-by-module in LoRA-MoE and removes load balancing after pruning, cutting trainable parameters 35-43% and raising throughput ~10% while matching or exceeding uniform baselines on reasoning tasks.
Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x higher throughput.
Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
citing papers explorer
-
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
-
Language Models are Few-Shot Learners
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
-
Test-Time Speculation
Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.
-
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
MoLF routes updates between full fine-tuning and LoRA at the optimizer level to match or exceed the better of either static method, with an efficient LoRA-only variant outperforming prior adaptive approaches.
-
XPERT: Expert Knowledge Transfer for Effective Training of Language Models
XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.
-
GEM: Graph-Enhanced Mixture-of-Experts with ReAct Agents for Dialogue State Tracking
GEM achieves 65.19% joint goal accuracy on MultiWOZ 2.2 by routing between a graph neural network expert for dialogue structure and a T5 expert for sequences, plus ReAct agents for value generation, outperforming prior SOTA methods.
-
MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents
A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.
-
Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding
EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.
-
Mixture of Heterogeneous Grouped Experts for Language Modeling
MoHGE achieves standard MoE performance with 20% fewer parameters and balanced GPU utilization via grouped heterogeneous experts, two-level routing, and specialized auxiliary losses.
-
Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling
X-GRAM applies data-aware dynamic token injection with hybrid hashing and local feature extraction to achieve up to 4.4 accuracy point gains over vanilla backbones and 3.2 over retrieval baselines at 0.73B-1.15B scales using 50% smaller tables.
-
AlignCultura: Towards Culturally Aligned Large Language Models?
Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.
-
ST-MoE: Designing Stable and Transferable Sparse Expert Models
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.
-
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
GShard supplies automatic sharding and conditional computation support that enabled training a 600-billion-parameter multilingual translation model on thousands of TPUs with superior quality.
-
Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter Matching
At tiny scale, MoE transformers lower validation loss versus dense models when active parameters match but raise it when total stored parameters match.
-
TIDE: Every Layer Knows the Token Beneath the Context
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
-
ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring
ARGUS uses a Prosecutor-Defender-Umpire multi-agent setup plus RAG and chain-of-thought rewards to adapt ad policy enforcement to new regulations using minimal fresh labels.
-
Attention Is All You Need
Pith review generated a malformed one-line summary.
-
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.