HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.
super hub Canonical reference
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Canonical reference. 75% of citing Pith papers cite this work as background.
abstract
The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational effic
authors
co-cited works
representative citing papers
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Multi-agent LLMs generate and verify 14,073 deterministic reaction rules from 665,901 patents, enabling 97.7% classification of unseen reactions with finer resolution than fixed proprietary systems.
FRAME adds a learnable fractional-Fourier order per expert in a MoE-LoRA setup so that low-rank updates are placed in the domain where they are most compact, yielding gains over fixed-domain baselines on LLaMA-3.1-8B and Qwen2.5-7B.
Donor-driven nodule properties in synthetic CT transfer to real lung CT vision-language tasks while host-driven anatomy properties do not, enabling a label-free diagnostic for model routing.
SharpMoE is a plug-and-play post-training method that uses clean latent features and a trajectory routing loss to enable accurate saliency-based routing in diffusion MoE models for improved visual generation.
GeMoE adaptively sets the number of experts per token via gating entropy, retaining 99.5% of static-routing performance while raising average sparsity by 36.5%.
ForeMoE uses routing foresight from the rollout stage to enable micro-step load balancing in MoE RL post-training via a hierarchical planner and transfer engine, claiming up to 1.45x speedup on 64 GPUs.
PCCL synthesizes near-optimal topology-aware collective algorithms for arbitrary patterns while being process group-aware and scalable to subsets of devices.
Fisher-MoE prunes sparse intermediate dimensions in MoE FFNs ranked by Fisher importance, delivering 50% compression that preserves capability while cutting memory ~45% and raising throughput 21%.
Argus achieves the highest reported NDCG scores among open late-interaction models on ViDoRe V1 and combined V1+V2 by introducing query-dependent document representations via a region-aware MoE on Qwen3.5-VL, trained on 9% of public data with a 1024-dim head.
ViBE co-optimizes expert placement with measured GPU performance variability in MoE inference to cut execution-time imbalance, delivering 14% better SLO attainment and up to 45% lower P90 TTFT.
A mean-field limit of a reinforcement-based softmax router for two experts shows a supercritical pitchfork bifurcation, with an external asymmetry unfolding it into a cusp of fold bifurcations.
L2Rec introduces dual-view personalized low-rank perturbations via DPMoE to let one LLM backbone produce complementary behavioral and semantic adaptations, with cross-view fusion, outperforming baselines on four datasets and in industrial A/B tests.
ArchSIBench is a new benchmark dataset and evaluation suite that measures vision-language models on architectural spatial intelligence across 17 subtasks, showing most models lag human baselines especially in transformation and configuration.
Expert specialization in vision MoE models is dominated by a stable animate-inanimate distinction visible from gating to readout, with broader tuning to continuous visual and semantic dimensions rather than narrow categorical preferences.
Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the primary bottleneck across six backbones.
DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.
MuteBench evaluates multimodal fusion robustness to modality missing and within-modality missing on 125000 samples from 9 clinical datasets, finding architecture family predicts tolerance better than parameter count.
PRISM-VQ integrates vector-quantized latent factors with financial priors and a structure-conditioned mixture-of-experts to deliver improved cross-sectional stock return predictions and portfolio performance on CSI 300 and S&P 500.
Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.
citing papers explorer
-
Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling
Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.
-
STK-Adapter: Incorporating Evolving Graph and Event Chain for Temporal Knowledge Graph Extrapolation
STK-Adapter adds Spatial-Temporal MoE, Event-Aware MoE, and Cross-Modality Alignment MoE to integrate evolving TKG graphs and event chains into LLMs, reducing information loss and improving extrapolation performance over prior methods.
-
Domain-Specialized Object Detection via Model-Level Mixtures of Experts
Model-level MoE of domain-specialized YOLO detectors with gating network outperforms standard ensembles on BDD100K while revealing expert specialization.
-
WILD-SAM: Phase-Aware Expert Adaptation of SAM for Landslide Detection in Wrapped InSAR Interferograms
WILD-SAM is a fine-tuned SAM variant using phase-aware MoE adapters and wavelet subband enhancement that achieves state-of-the-art landslide detection on wrapped InSAR data.
-
SubFlow: Sub-mode Conditioned Flow Matching for Diverse One-Step Generation
SubFlow restores full mode coverage in one-step flow matching by conditioning on sub-modes from semantic clustering, yielding higher diversity on ImageNet-256 while preserving FID.
-
Sustainability Is Not Linear: Quantifying Performance, Energy, and Privacy Trade-offs in On-Device Intelligence
Empirical case study on a flagship Android device profiles energy, latency, and quality trade-offs across eight LLMs, revealing a quantization energy paradox and identifying mid-sized models as practical sweet spots.
-
Evaluating Small Language Models for Front-Door Routing: A Harmonized Benchmark and Synthetic-Traffic Experiment
Qwen-2.5-3B achieves 0.793 accuracy and 988 ms median latency on six-class task routing but misses the pre-registered viability bar of 0.85 accuracy and 2000 ms P95 latency.
-
Beyond Sunk Costs: Boosting LLM Pre-training Efficiency via Orthogonal Growth of Mixture-of-Experts
Orthogonal growth recycles pre-trained MoE checkpoints via layer copying and noisy expert duplication, delivering 10.6% higher accuracy than training from scratch with equivalent extra compute.
-
Joint Learning using Mixture-of-Expert-Based Representation for Speech Enhancement and Robust Emotion Recognition
Sparse MERIT uses frame-wise sparse mixture-of-experts with task-specific gating on self-supervised speech features to jointly optimize enhancement and emotion recognition, reporting gains over baselines on MSP-Podcast at low SNR.
-
PiKV: KV Cache Management System for Mixture of Experts
PiKV proposes expert-sharded KV storage, PiKV routing, adaptive scheduling, and compression modules to reduce overhead in multi-GPU MoE inference.
-
Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate
Demonstrates that Transformers can continue learning when grown modularly above a frozen minimal token interface under a fixed active-parameter budget, with reported viability in 9-layer and 16-layer experiments.
-
Test-Time Alignment via Hypothesis Reweighting
HyRe personalizes reward models at test time by reweighting an ensemble of heads trained on aggregate preferences, using few target examples to outperform uniform averaging and prior methods on RewardBench and 32 tasks.
-
Mixtral of Experts
Mixtral 8x7B is a sparse MoE LLM activating 2 of 8 experts per layer that matches or exceeds Llama 2 70B and GPT-3.5 on benchmarks while using only 13B active parameters.
-
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
Trained the largest monolithic 530B-parameter transformer language model to date and reported new state-of-the-art zero- and few-shot results on multiple NLP benchmarks.
-
4K-Memristor Analog-Grade Passive Crossbar Circuit
Experimental demonstration of a 4K-device passive analog memristor crossbar with high yield and sufficient precision for neuromorphic pattern classification.
-
Attention Is All You Need
Pith review generated a malformed one-line summary.
-
A Multi-task Mixture of Experts Framework for Malware Classification, Packing Detection, and Family Attribution
A Multi-Gate MoE architecture achieves 0.9744 combined detection rate across three malware tasks with improved robustness to mutations compared to other MoE variants.
-
Does Role Specialization Matter for Explanation Faithfulness in Mixture-of-Experts?
Representation decorrelation regularization in MoE models improves explanation faithfulness on multimodal benchmarks while preserving task performance.
-
CMSL: Constructive Multi-Sequence Learning for Recommendation Systems
CMSL uses a learnable module to disentangle user history into multiple pure sequences modeled with linear attention to improve recommendation performance over single-sequence approaches.
-
Beyond Feedforward Networks: Reentry Neural Systems as the Fundamental Basis of Subjecthood and Intrinsic Safety of Next-Generation AGI
A cycle-based reentry architecture is proposed to guarantee self-model emergence, self-preservation, and prompt-injection immunity in AGI via a D-I loop and a new S-measure of integrated information.
-
LiMoDE: Rethinking Lifelong Robot Manipulation from a Mixture-of-Dynamic-Experts Perspective
LiMoDE uses dynamic MoE pre-training on motion cues followed by lifelong expert addition for continuous robot task adaptation.
-
Does Mixture-of-Experts Actually Help Inference on Consumer and Edge Hardware? An Empirical Study
Empirical benchmarks show MoE inference cost on edge hardware tracks total parameters rather than active parameters, with OLMoE-1B-7B behind dense baselines especially on the Jetson device.
-
Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale
Technical report announcing Ling-2.6 and Ring-2.6 models with hybrid linear attention, evolutionary CoT, and KPop RL for efficient agentic intelligence at scale.
-
Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling
A 120B sparse MoE model with 460 experts was trained on one 8-GPU node to loss 1.78 using reversible recurrence and state-preserving scaling from a 1.78B dense seed, with 5.93B active parameters.
-
Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition
Rank-aware selective fusion via attention-based gating and decoupled presence/salience heads with unsupervised domain adaptation outperforms baselines and ranks 2nd on the BlEmoRE challenge for blended emotion recognition.
-
Tracing the ongoing emergence of human-like reasoning in Large Language Models
LLMs function as accurate semantic processors for conditionals but do not replicate the pragmatic inferences that define human reasoning.
-
GoodServe: Towards High-Goodput Serving of Agentic LLM Inferences over Heterogeneous Resources
GoodServe proposes a predict-and-rectify routing system for agentic LLM inferences on heterogeneous GPUs that improves goodput by up to 27.4%.
-
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
-
Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics
LLM serving requires mathematical optimization and algorithms with provable guarantees rather than generic heuristics that fail unpredictably on LLM workloads.
-
CoRE: Concept-Reasoning Expansion for Continual Brain Lesion Segmentation
CoRE aligns image tokens to a hierarchical concept library to simulate clinical reasoning for expert routing and demand-based growth in continual brain lesion segmentation, achieving SOTA on 12 tasks.
-
Enhancing Online Recruitment with Category-Aware MoE and LLM-based Data Augmentation
LLM chain-of-thought rewriting of job postings plus category-aware MoE improves person-job fit AUC by 2.4%, GAUC by 7.5%, and live click-through conversion by 19.4%.
-
Quadruped Parkour Learning: Sparsely Gated Mixture of Experts with Visual Input
Sparsely gated MoE policies double the success rate of a real Unitree Go2 quadruped on large-obstacle parkour versus matched-active-parameter MLP baselines while cutting inference time compared with a scaled-up MLP.
-
A Two-Stage Multi-Modal MRI Framework for Lifespan Brain Age Prediction
A two-stage architecture processes multi-modal MRI data independently before late fusion to classify developmental stages and predict brain age within stages for lifespan assessment.
-
Efficient Handwriting-Based Alzheimer,s Disease Diagnosis Using a Low-Rank Mixture of Experts Deep Learning Framework
A low-rank mixture of experts model trained on handwriting data delivers strong Alzheimer's diagnosis performance with substantially reduced parameter activation during inference.
-
Evaluating Cross-Architecture Performance Modeling of Distributed ML Workloads Using StableHLO
StableHLO serves as a viable unified representation for cross-architecture performance modeling of distributed ML workloads, preserving relative trends while exposing fidelity trade-offs.
-
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.
-
GRaD-Nav++: Vision-Language Model Enabled Visual Drone Navigation with Gaussian Radiance Fields and Differentiable Dynamics
GRaD-Nav++ combines 3D Gaussian Splatting simulation and differentiable RL to train an onboard VLA policy that achieves 50-83% success on language-guided drone navigation tasks in simulation and real hardware.
-
A 35B Hybrid-Attention Mixture-of-Experts Model on a 6GB 2011 GPU: Hand-Written 4-bit CUDA Inference for Fermi
Engineers implement hybrid CPU-GPU 4-bit inference for a 35B hybrid-attention MoE on a 2011 Fermi GPU, reporting 34% lower prefill latency and 3x decode throughput via custom kernels and expert pinning.
-
Hybrid Compression: Integrating Pruning and Quantization for Optimized Neural Networks
Hybrid method applies pruning and quantization followed by MoE routing of compressed CNN experts to achieve large reductions in FLOPs and parameters with negligible accuracy loss on benchmarks.
-
Logit Distillation on Manifolds: Mapping by Learning
Presents a layer- and point-wise projection mapping for manifold-based logit distillation combined with LoRA to enable low-parameter student training with reported WER gains.
-
MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent
MMoA adds LSTM recurrence to Mixture-of-Agents routing, reaching 58.0% win rate on AlpacaEval 2.0 versus 59.8% for baseline MoA while cutting runtime by up to 4.6%.
-
Deep Learning for Electricity Price Forecasting: A Review of Day-Ahead, Intraday, and Balancing Electricity Markets
A structured review organizes deep learning models for electricity price forecasting via a backbone-head-loss taxonomy and identifies gaps in intraday and balancing market applications.
-
LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems
A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
-
The Hitchhiker's Guide to Agentic AI: From Foundations to Systems
A comprehensive reference book organizing existing techniques for agentic AI systems across LLM substrate, reasoning, agent design patterns, inter-agent coordination, and production deployment.
-
Benchmarking PNW Model for MedMNIST to 100% Accuracy
A new 'Artificial Special Intelligence' method is claimed to enable error-free training of classification models to 100% accuracy on 15 of 18 MedMNIST biomedical datasets.
-
Opportunities and Challenges of Large Language Models for Low-Resource Languages in Humanities Research
This survey paper identifies opportunities for LLMs in low-resource language humanities research along with challenges in data accessibility, model adaptability, and cultural sensitivity.
-
A Comprehensive Overview of Large Language Models
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.
- Hyperloop Transformers