HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.
super hub Canonical reference
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Canonical reference. 75% of citing Pith papers cite this work as background.
abstract
The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational effic
authors
co-cited works
representative citing papers
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Multi-agent LLMs generate and verify 14,073 deterministic reaction rules from 665,901 patents, enabling 97.7% classification of unseen reactions with finer resolution than fixed proprietary systems.
FRAME adds a learnable fractional-Fourier order per expert in a MoE-LoRA setup so that low-rank updates are placed in the domain where they are most compact, yielding gains over fixed-domain baselines on LLaMA-3.1-8B and Qwen2.5-7B.
Donor-driven nodule properties in synthetic CT transfer to real lung CT vision-language tasks while host-driven anatomy properties do not, enabling a label-free diagnostic for model routing.
ForeMoE uses routing foresight from the rollout stage to enable micro-step load balancing in MoE RL post-training via a hierarchical planner and transfer engine, claiming up to 1.45x speedup on 64 GPUs.
PCCL synthesizes near-optimal topology-aware collective algorithms for arbitrary patterns while being process group-aware and scalable to subsets of devices.
Fisher-MoE prunes sparse intermediate dimensions in MoE FFNs ranked by Fisher importance, delivering 50% compression that preserves capability while cutting memory ~45% and raising throughput 21%.
Argus achieves the highest reported NDCG scores among open late-interaction models on ViDoRe V1 and combined V1+V2 by introducing query-dependent document representations via a region-aware MoE on Qwen3.5-VL, trained on 9% of public data with a 1024-dim head.
ViBE co-optimizes expert placement with measured GPU performance variability in MoE inference to cut execution-time imbalance, delivering 14% better SLO attainment and up to 45% lower P90 TTFT.
A mean-field limit of a reinforcement-based softmax router for two experts shows a supercritical pitchfork bifurcation, with an external asymmetry unfolding it into a cusp of fold bifurcations.
L2Rec introduces dual-view personalized low-rank perturbations via DPMoE to let one LLM backbone produce complementary behavioral and semantic adaptations, with cross-view fusion, outperforming baselines on four datasets and in industrial A/B tests.
ArchSIBench is a new benchmark dataset and evaluation suite that measures vision-language models on architectural spatial intelligence across 17 subtasks, showing most models lag human baselines especially in transformation and configuration.
Expert specialization in vision MoE models is dominated by a stable animate-inanimate distinction visible from gating to readout, with broader tuning to continuous visual and semantic dimensions rather than narrow categorical preferences.
Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the primary bottleneck across six backbones.
DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.
MuteBench evaluates multimodal fusion robustness to modality missing and within-modality missing on 125000 samples from 9 clinical datasets, finding architecture family predicts tolerance better than parameter count.
PRISM-VQ integrates vector-quantized latent factors with financial priors and a structure-conditioned mixture-of-experts to deliver improved cross-sectional stock return predictions and portfolio performance on CSI 300 and S&P 500.
Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.
SDG-MoE introduces learned signed interaction graphs and disagreement-gated deliberation among experts in MoE architectures, yielding 19.8% better validation perplexity than the strongest baseline.
DTSemNet gives an exact, invertible neural-network encoding of hard oblique decision trees that supports direct gradient training for both classification and regression without probabilistic softening or quantized estimators.
citing papers explorer
-
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
GShard supplies automatic sharding and conditional computation support that enabled training a 600-billion-parameter multilingual translation model on thousands of TPUs with superior quality.
-
Review Residuals: Update-Conditioned Residual Gating for Transformers
Review Residuals add an update-conditioned gate to transformer residual connections, yielding depth-stable training and performance gains that emerge and grow with model size from 590M parameters upward.
-
Clinically Structured Rank-Gated LoRA for Cross-Benchmark Medical Question Answering
BiRG-LoRA achieves 69.31% macro-average accuracy across CMB, CMExam, MedQA, and MedMCQA, outperforming MoELoRA by 0.89 points with 28.1% fewer trainable parameters under a matched Qwen3-8B protocol.
-
Mixture-of-Control: State-Aware Fine-Tuning for Transformer-based Models
Mixture-of-Control adaptively combines local and global control states in transformer fine-tuning by treating per-block states as experts in a sparse MoE setup to improve cross-block communication while keeping memory and compute costs comparable to prior state-based methods.
-
Language-Assisted Super-Resolution from Real-World Low-Resolution Patches
LA-SR extracts real LR patches from depth-varying regions in single images and uses vision-language models with linguistic content and quality losses for unpaired super-resolution.
-
Orthogonal Representation Editing: Decoupling Semantic Entanglement in Batch Knowledge Editing of LLMs
ORE decouples semantic entanglement in LLM hidden states via orthogonal edit vectors and a gated non-linear head, improving batch knowledge editing performance including cross-lingual cases.
-
Sakana Fugu Technical Report
Sakana Fugu trains LLM orchestrators using fine-tuning, evolutionary algorithms, and RL to build query-adaptive multi-agent scaffolds, claiming SOTA results on benchmarks including SWE-Bench Pro and GPQA-Diamond.
-
Factor-Aware Mixture-of-Experts with Pretrained Encoder for Combinatorial Generalization
FAME combines a factor-aware MoE with frozen pretrained encoders via staged adapter training and joint fine-tuning, reporting 34% gains on Meta-World and 35% in real-world pick-and-place under environmental changes.
-
Predicting Immune Biomarkers with MultiModal Mixture-of-Expert Pathology Foundation Models Empowers Precision Oncology
MixTIME uses a learnable-router MoE to fuse three pathology foundation models for pixel- and slide-level prediction of 17 mIF protein markers from H&E images, improving spatial domain ID, survival prediction, and pathologist-validated report generation.
-
Profy: Interpretable Visualization of Expertise-Dependent Motor Skills Toward Supporting Piano Practice
Profy uses take-level expert-amateur labels on 1083 piano recordings to produce time-aligned highlight scores that correlate with expert review points (r=0.61) on held-out amateur clips.
-
Loss-Guided Adaptive Scale Refinement for Molecular Force Prediction
Loss-guided adaptive scale refinement on NaCl aqueous system reduces overall force MAE from 399.65 to 381.23 by discovering intermediate scales from initial anchors.
-
Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning
SETA decomposes parameters into task-specific and shared sparse experts with adaptive anchoring and routing regularization to improve retention and backward transfer in LLM continual learning.
-
HANDOFF: Humanoid Agentic Task-Space Whole-Body Control via Distilled Complementary Teachers
HANDOFF is a distilled mixture-of-experts humanoid whole-body controller that follows a compact task-space interface, matches SOTA velocity tracking, provides large manipulation workspace on Unitree G1, and supports VLM-driven agentic planning with no task-specific data.
-
Analyzing Stream Collapse in Hyper-Connections: From Diagnosis to Mitigation
Hyper-Connections models show stream collapse to a dominant stream with near-identity residual mixing after seeding; symmetry-breaking initialization mitigates dominance and raises performance.
-
MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence
MASER trains modality adapters on a shared VLM and routes questions via a learned MLP to the best adapter, reaching 51.3% oracle agreement on Open3D-VQA while using one adapter call per question.
-
Gravity-Aware Hierarchical Routing for Lightweight SensorLLM on Human Activity Recognition
Introduces a lightweight gravity-aware routing head that improves macro-F1 on static classes in compressed SensorLLM for human activity recognition on the MHealth dataset.
-
When Meaning Travels: A Granular Lens on Hybrid-MoE's Role in Idiomatic Understanding for Language Models
HybridMoE with controlled hybridization and idiomatic property signals yields 5-6% gains in figurative language representation for multilingual vision-language models.
-
DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts
DAG-MoE uses a lightweight module to learn DAG-based structural aggregation of selected experts, expanding combination space and enabling intra-layer multi-step reasoning compared to standard weighted-sum MoE.
-
GNMR: Runtime Stability Control for Low-Precision Large Language Model Training
GNMR is a gradient-norm-based controller that maps local stability signals to budgeted recovery actions to stabilize low-precision LLM training while preserving quality.
-
Can BEV Perception Gracefully Degrade under Sensor Failures?
Grace-BEV enables graceful degradation in BEV perception under sensor failures by using a TrustGate Router for modality trustworthiness and FailSafe Fusion Block for dynamic integration, with modality dropout training, restoring performance to 34.7% mAP under LiDAR failure on nuScenes variants.
-
Graph-Conditioned Mixture of Graph Neural Network Experts for Traffic Forecasting
GC-MoE improves MAE on four traffic forecasting benchmarks by routing nodes to combinations of frozen spatio-temporal GNN experts via a graph-conditioned lightweight router, training only ~17K parameters atop 1.5M frozen weights.
-
Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling
Dense2MoE unifies pruning of attention modules with upcycling of MLPs into MoE experts to produce on-device LLMs that improve the latency-accuracy Pareto frontier.
-
BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma
BioFact-MoE applies a biologically factorized MoE architecture to multimodal MRI-report data and reports improved 12-24 month survival AUCs plus selective embedding associations in an N=588 HCC cohort.
-
Safety-Oriented Routing Analysis of Mixtral MoE Under Benign and Harmful Prompts
Empirical routing analysis of Mixtral shows safety-relevant signals are distributed and depth-dependent rather than localized to fixed experts.
-
Asymmetric Scaling Laws from Sparse Features
A sparse-activation model predicts double-descent loss with distinct under- and over-parameterized scaling exponents set by sparsity, plus a compute-optimal frontier favoring dataset growth.
-
EasyVFX: Frequency-Driven Decoupling for Resource-Efficient VFX Generation
EasyVFX decouples VFX generation via frequency-aware Mixture-of-Experts and test-time training to achieve realistic effects with limited resources.
-
Memory-Induced Supra-Competitive Outcomes Between Deep Reinforcement Learning Agents in Optimal Trade Execution
In a two-agent Almgren-Chriss liquidation game, deep RL agents given intra-episode history of prices and own actions achieve supra-competitive outcomes more frequently and persistently than agents without such memory.
-
Heterogeneity-Aware Dataset Scheduling for Efficient Audio Large Language Model Training
GST uses gradient-based affinity metrics to form dataset groups and applies progressive scheduling, achieving 30-40% faster convergence than uniform mixture training on 14 AudioQA datasets while matching or exceeding performance.
-
Guard: Scalable Straggler Detection and Node Health Management for Large-Scale Training
Guard combines online performance monitoring and offline node qualification to detect stragglers and fail-slow behaviors in large-scale training, reporting up to 1.7x higher mean FLOPs utilization and reduction of step-time variance from 20% to 1%.
-
DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization
DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.
-
NGM: A Plug-and-Play Training-Free Memory Module for LLMs
NGM is a plug-and-play n-gram memory module that encodes n-grams from pretrained embeddings and gates their injection to improve LLM performance by 0.5-1.2 points on average across eight benchmarks.
-
Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter Matching
At tiny scale, MoE transformers lower validation loss versus dense models when active parameters match but raise it when total stored parameters match.
-
Probing Routing-Conditional Calibration in Attention-Residual Transformers
Routing summaries and auxiliary features do not provide stable evidence of conditional miscalibration in AR transformers once confidence-matched baselines, capacity controls, and permutation nulls are applied.
-
Improving Generalization by Permutation Routing Across Model Copies
Replicating models and routing their local losses via permutations from a mixing kernel Q enables structured message sharing that improves generalization.
-
Sparse Layers are Critical to Scaling Looped Language Models
Looped-MoE models scale better than dense looped or standard transformers because routing changes across loops, and they enable stronger compute-quality trade-offs via early exits at loop boundaries.
-
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
EngGPT2MoE-16B-A3B matches or exceeds other Italian open-source LLMs on most international benchmarks while remaining competitive on ITALIC, though it trails some top international models.
-
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents
PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing, recommendation, and protein tasks.
-
TIDE: Every Layer Knows the Token Beneath the Context
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
-
Complexity Horizons of Compressed Models in Analog Circuit Analysis
Prerequisite graphs map compressed LLM performance boundaries in analog circuit analysis to allow selecting the smallest viable model for a given task complexity.
-
ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring
ARGUS uses a Prosecutor-Defender-Umpire multi-agent setup plus RAG and chain-of-thought rewards to adapt ad policy enforcement to new regulations using minimal fresh labels.
-
DeepPropNet: an operator learning-based predictor for thermal plasma properties
DeepPropNet predicts thermal plasma properties with relative L2 errors of 10^{-3} to 10^{-2} for SF6-N2 and C4F7N-CO2-O2 mixtures using single-property and mixture-of-experts architectures trained on high-fidelity data.
-
FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving
FaaSMoE treats MoE experts as on-demand FaaS functions with configurable granularity, using under one-third the resources of a full-model baseline under multi-tenant workloads.
-
Structure-Semantic Decoupled Modulation of Global Geospatial Embeddings for High-Resolution Remote Sensing Mapping
SSDM decouples global geospatial embeddings into structural modulation and semantic injection pathways to improve accuracy and consistency in high-resolution remote sensing land cover mapping.
-
Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling
Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.
-
STK-Adapter: Incorporating Evolving Graph and Event Chain for Temporal Knowledge Graph Extrapolation
STK-Adapter adds Spatial-Temporal MoE, Event-Aware MoE, and Cross-Modality Alignment MoE to integrate evolving TKG graphs and event chains into LLMs, reducing information loss and improving extrapolation performance over prior methods.
-
Domain-Specialized Object Detection via Model-Level Mixtures of Experts
Model-level MoE of domain-specialized YOLO detectors with gating network outperforms standard ensembles on BDD100K while revealing expert specialization.
-
WILD-SAM: Phase-Aware Expert Adaptation of SAM for Landslide Detection in Wrapped InSAR Interferograms
WILD-SAM is a fine-tuned SAM variant using phase-aware MoE adapters and wavelet subband enhancement that achieves state-of-the-art landslide detection on wrapped InSAR data.
-
SubFlow: Sub-mode Conditioned Flow Matching for Diverse One-Step Generation
SubFlow restores full mode coverage in one-step flow matching by conditioning on sub-modes from semantic clustering, yielding higher diversity on ImageNet-256 while preserving FID.
-
Sustainability Is Not Linear: Quantifying Performance, Energy, and Privacy Trade-offs in On-Device Intelligence
Empirical case study on a flagship Android device profiles energy, latency, and quality trade-offs across eight LLMs, revealing a quantization energy paradox and identifying mid-sized models as practical sweet spots.
-
Evaluating Small Language Models for Front-Door Routing: A Harmonized Benchmark and Synthetic-Traffic Experiment
Qwen-2.5-3B achieves 0.793 accuracy and 988 ms median latency on six-class task routing but misses the pre-registered viability bar of 0.85 accuracy and 2000 ms P95 latency.