org/abs/2202.09368

Mixture-of-Experts with Expert Choice Routing , author= · 2022 · arXiv 2202.09368

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

A Minimal Bifurcation Model of Load Imbalance in a Softmax Mixture-of-Experts Router

math.DS · 2026-05-27 · unverdicted · novelty 7.0

A mean-field limit of a reinforcement-based softmax router for two experts shows a supercritical pitchfork bifurcation, with an external asymmetry unfolding it into a cusp of fold bifurcations.

Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts

cs.LG · 2026-05-01 · conditional · novelty 7.0

Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant interaction.

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

cs.LG · 2024-08-28 · conditional · novelty 7.0

Loss-Free Balancing keeps expert loads balanced in MoE models by dynamically adjusting routing-score biases based on recent usage, avoiding auxiliary-loss interference and yielding better performance.

Generic Expert Coverage for Pruning SparseMixture-of-Experts Language Models

cs.AI · 2026-07-02 · unverdicted · novelty 6.0

Coverage-aware pruning using per-corpus utility profiles on WikiText2 and C4 improves zero-shot accuracy and reduces perplexity degradation in two MoE models at 25-75% retention compared to baselines, without downstream data.

Schedule-Level Shared-Prefix Reuse for LLM RL Training

cs.DC · 2026-05-31 · unverdicted · novelty 6.0

Schedule-level shared-prefix reuse decouples prefix and suffix passes in GRPO training to compute shared prefixes once, delivering up to 4.395x speedup and 59.1% HBM reduction while preserving numerical equivalence.

Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism

cs.DC · 2026-05-06 · unverdicted · novelty 6.0

Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.

AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

cs.CL · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

cs.CL · 2024-11-07 · conditional · novelty 6.0

MoT decouples non-embedding parameters by modality in transformers to match dense multi-modal performance with roughly one-third to one-half the FLOPs.

Language-Assisted Super-Resolution from Real-World Low-Resolution Patches

cs.CV · 2026-06-30

citing papers explorer

Showing 1 of 1 citing paper after filters.

Language-Assisted Super-Resolution from Real-World Low-Resolution Patches cs.CV · 2026-06-30 · unreviewed · ref 47

org/abs/2202.09368

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer