HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.
super hub Canonical reference
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Canonical reference. 75% of citing Pith papers cite this work as background.
abstract
The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational effic
authors
co-cited works
representative citing papers
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Donor-driven nodule properties in synthetic CT transfer to real lung CT vision-language tasks while host-driven anatomy properties do not, enabling a label-free diagnostic for model routing.
A mean-field limit of a reinforcement-based softmax router for two experts shows a supercritical pitchfork bifurcation, with an external asymmetry unfolding it into a cusp of fold bifurcations.
L2Rec introduces dual-view personalized low-rank perturbations via DPMoE to let one LLM backbone produce complementary behavioral and semantic adaptations, with cross-view fusion, outperforming baselines on four datasets and in industrial A/B tests.
ArchSIBench is a new benchmark dataset and evaluation suite that measures vision-language models on architectural spatial intelligence across 17 subtasks, showing most models lag human baselines especially in transformation and configuration.
Expert specialization in vision MoE models is dominated by a stable animate-inanimate distinction visible from gating to readout, with broader tuning to continuous visual and semantic dimensions rather than narrow categorical preferences.
Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the primary bottleneck across six backbones.
DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.
MuteBench evaluates multimodal fusion robustness to modality missing and within-modality missing on 125000 samples from 9 clinical datasets, finding architecture family predicts tolerance better than parameter count.
PRISM-VQ integrates vector-quantized latent factors with financial priors and a structure-conditioned mixture-of-experts to deliver improved cross-sectional stock return predictions and portfolio performance on CSI 300 and S&P 500.
Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.
SDG-MoE introduces learned signed interaction graphs and disagreement-gated deliberation among experts in MoE architectures, yielding 19.8% better validation perplexity than the strongest baseline.
DTSemNet gives an exact, invertible neural-network encoding of hard oblique decision trees that supports direct gradient training for both classification and regression without probabilistic softening or quantized estimators.
MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
StrLoRA is a regularized two-stage expert routing method for streaming CVIT that selects experts via textual instructions and applies token-wise cross-modal weighting with historical routing alignment.
SplatWeaver uses cardinality Gaussian experts and pixel-level routing to dynamically allocate varying numbers of Gaussian primitives for generalizable novel view synthesis.
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
MoE experts in pretrained Transformers exhibit functional decorrelation with near-zero Jacobian alignment yet occupy partially overlapping representation subspaces, with routing sparsity modulating the geometry.
TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.
CEA assembles per-token low-rank residual updates via dense affinities over hyper-adapter-generated components to improve all-in-one image restoration on spatially non-uniform degradations.
citing papers explorer
-
Using large language models for embodied planning introduces systematic safety risks
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.