A mean-field limit of a reinforcement-based softmax router for two experts shows a supercritical pitchfork bifurcation, with an external asymmetry unfolding it into a cusp of fold bifurcations.
org/abs/2202.09368
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant interaction.
Loss-Free Balancing keeps expert loads balanced in MoE models by dynamically adjusting routing-score biases based on recent usage, avoiding auxiliary-loss interference and yielding better performance.
Coverage-aware pruning using per-corpus utility profiles on WikiText2 and C4 improves zero-shot accuracy and reduces perplexity degradation in two MoE models at 25-75% retention compared to baselines, without downstream data.
Schedule-level shared-prefix reuse decouples prefix and suffix passes in GRPO training to compute shared prefixes once, delivering up to 4.395x speedup and 59.1% HBM reduction while preserving numerical equivalence.
Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.
AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.
MoT decouples non-embedding parameters by modality in transformers to match dense multi-modal performance with roughly one-third to one-half the FLOPs.