arxiv: 2604.14419 · v1 · submitted 2026-04-15 · 💻 cs.AI

Recognition: unknown

Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality

Ivan Ternovtsii, Yurii Bilak

Pith reviewed 2026-05-10 12:58 UTC · model grok-4.3

classification 💻 cs.AI

keywords mixture of expertsrouting topologyequifinalitylanguage modelingperplexitycosine similaritysparse modelsparameter efficiency

0 comments

The pith

Routing topology does not determine asymptotic perplexity in Mixture-of-Experts language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the design of token routing to experts in sparse Mixture-of-Experts architectures controls final language modeling performance. By running 62 matched experiments on WikiText-103 at the 76-84M scale, it shows that five distinct cosine-similarity routing variants produce statistically equivalent perplexity scores. The result holds after training to convergence and replicates on OpenWebText, with only minor gaps for hash, random, or top-1 routing. A linear router with far more parameters performs better, yet an iso-parameter cosine version recovers most of the advantage. This equifinality points to convergent redundancy in routing updates rather than topology-specific gains.

Core claim

In a geometric ST-MoE using cosine-similarity routing to learned centroids in 64-dimensional space, five routing variants achieve asymptotic perplexities that are statistically equivalent within a 1-PPL margin on WikiText-103, confirmed by TOST tests with p < 0.05 across all pairwise comparisons after 50K training steps. The equivalence extends to hash, random-fixed, and top-1 routing with 1.1-2.2 PPL degradation. A standard linear router with 5.3 times more routing parameters reaches 32.76 PPL, but matched cosine routing closes 67 percent of the gap, leaving a true mechanism advantage of about 1.2 percent. Multi-hop updates prove collinear with cosine similarity 0.805, acting as magnitude放大

What carries the argument

Cosine-similarity routing to learned centroids in a 64-dimensional space, which reduces routing parameters by 80 percent while allowing direct comparison of topology variants under controlled training.

If this is right

Multi-hop routing trajectories are largely collinear and implement magnitude amplification rather than compositional reasoning.
A single learnable scalar multiplier can replicate the performance of multi-hop updates.
Zero-shot relative-norm halting reduces MoE FLOPs by 25 percent at a cost of only 0.12 percent PPL.
Expert-level specialization and causal controllability can coexist with topology-level equifinality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

At this model scale the limiting factor for performance may be total capacity or training compute rather than routing sophistication.
Future designs could safely adopt simpler, lower-parameter routers without quality loss.
The result leaves open whether equifinality persists at much larger scales where routing efficiency might matter more.

Load-bearing premise

The 62 controlled experiments on WikiText-103 fully isolate routing topology effects from other training variables such as optimization dynamics, expert capacity, or initialization.

What would settle it

A replication at the same scale and training budget that finds a perplexity gap larger than 1 point between any two cosine-routing variants reaching statistical significance would falsify the equifinality result.

Figures

Figures reproduced from arXiv: 2604.14419 by Ivan Ternovtsii, Yurii Bilak.

**Figure 1.** Figure 1: ST-MOE Architecture. (A) Full model: token embedding → N blocks → LM head. (B) Single block: preLN attention with RoPE, followed by pre-LN ST-MOE layer, both with residual connections. (C) Multi-hop routing: tokens are projected into a dspace-dimensional coordinate space, routed to top-K experts via cosine similarity with learned centroids, and accumulate expert updates across H hops with semantic positio… view at source ↗

**Figure 2.** Figure 2: Dense vs MoE Training Comparison. The iso-FLOP dense baseline (dff=12, same active FLOPs) tracks MoE throughout training; the iso-parameter dense baseline (dff=1120, ∼93× more active FLOPs) dominates after step 5K. MoE crosses the iso-FLOP dense at ∼10K steps, confirming that sparse routing provides value at matched active computation. 4.2 Iso-Parameter Dense Baseline The iso-FLOP comparison above is the p… view at source ↗

**Figure 3.** Figure 3: Convergence Marathon. Training curves for all configurations at 76M parameters, 50K steps (1.64B tokens). Among cosine-routing variants, all topologies converge to a statistically equivalent band (TOST, δ=1 PPL, all pairs p < 0.05; observed range 33.93–34.72 across 15 runs). The standard linear router (dashed cyan) falls outside this band at PPL 32.76, demonstrating that routing capacity matters even when … view at source ↗

**Figure 4.** Figure 4: Multi-Seed Validation. Training curves for Wide 1×12 and Deep 3×4 with seeds 42 and 137. Sametopology curves track nearly identically throughout training: Wide ∆=1.2%, Deep ∆=0.1%. All four runs converge within a 0.74 PPL band. Full 5-variant × 3-seed validation (15 runs total) confirms all topologies converge within a 0.79 PPL band (33.93–34.72), with inter-variant spread (0.60 PPL) only 5.0× the average… view at source ↗

**Figure 5.** Figure 5: Cross-Seed Expert Alignment. Left: vocabulary projection Jaccard similarity (best-match) is ∼500× above random—experts develop partially overlapping functional roles across seeds. Right: raw Wup cosine similarity is indistinguishable from random—these overlapping functions are realized through entirely different weight parameterizations. This divergence is the quantitative signature of equifinality. Cross… view at source ↗

**Figure 6.** Figure 6: Pareto Frontier. (a) MLP experts enable a smooth quality-compute tradeoff; static experts show a sharp cliff. (b) Halting response: MLP average hops decrease gradually with ε, while static experts exhibit a phase transition. After each hop, if the relative update norm falls below a threshold ε: ∥∆h∥ ∥x + haccum∥ + 10−6 < ε =⇒ halt (7) the token stops its trajectory early. No retraining is required—this exp… view at source ↗

**Figure 2.** Figure 2: The Depth Theorem Nonlinear Experts Unlock Recursive Depth MLP experts (solid) amplify the Deep vs Wide gap and scale better than static experts (dashed) [PITH_FULL_IMAGE:figures/full_fig_p023_2.png] view at source ↗

**Figure 7.** Figure 7: Depth advantage during underfitting. At 36M (a) and 76M (b), MLP experts amplify the Deep-vs-Wide gap. However, this advantage vanishes at convergence (Section 4.3): at 50K steps, Wide (PPL 33.93) beats Deep (PPL 34.62) [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 3.** Figure 3: Scale Emergence Navigational Diversity Grows with Model Size HopDiv, displacement, and MLP advantage all increase at larger scale [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗

**Figure 8.** Figure 8: Scale Emergence. (a) Depth advantage grows with scale. (b) Navigational diversity (HopDiv) increases 7× from 36M to 76M. (c) Trajectory displacement grows. (d) MLP improvement compounds: −4.6% at 36M → −5.7% at 76M. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

read the original abstract

Sparse Mixture-of-Experts (MoE) architectures employ increasingly sophisticated routing mechanisms -- learned routers, multi-hop trajectories, token-dependent gating. We ask: does routing topology actually determine language modeling quality? We build a geometric MoE (ST-MoE) using cosine-similarity routing against learned centroids in a low-dimensional space ($d_{space} = 64$), requiring 80% fewer routing parameters than standard linear routers. Through 62 controlled experiments on WikiText-103 at 76--84M parameters trained to convergence (50K steps, 1.64B tokens), we find that routing topology does not determine asymptotic perplexity (PPL): five cosine-routing variants are statistically equivalent within a 1-PPL margin (Two One-Sided Tests [TOST], $p < 0.05$ for all 10 pairwise comparisons; 15 runs across 3 seeds, observed range 33.93--34.72). The finding extends to hash, random-fixed, and top-1 routing (single-seed; graceful 1.1--2.2 PPL degradation) and replicates on OpenWebText (0.03 PPL gap, 6 runs, 3 seeds each). A standard linear router with 5.3$\times$ more routing parameters reaches PPL 32.76, but iso-parameter cosine routing closes 67% of this gap -- the true mechanism advantage is $\sim$1.2%. The mechanistic explanation is convergent redundancy: multi-hop updates are collinear ($\cos(\Delta h_0, \Delta h_1) = 0.805$), implementing magnitude amplification rather than compositional reasoning; a single learnable scalar replicates multi-hop performance. As a practical payoff, zero-shot relative-norm halting saves 25% of MoE FLOPs at +0.12% PPL. Expert-level specialization and causal controllability -- which coexist with topology-level equifinality -- are explored in a companion paper.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Routing topology in these MoE setups produces nearly identical perplexity, backed by many runs and equivalence tests, though training dynamics may still be entangled with the router choice.

read the letter

The central finding is that routing topology does not drive final language modeling quality in this regime. Five cosine-similarity router variants, plus hash, random, and top-1 baselines, land within a 1-PPL band on WikiText-103 after full training, with TOST tests showing statistical equivalence across all pairs. The geometric router itself uses a 64-dimensional space and cuts routing parameters by 80 percent while recovering most of the gap to a heavier linear router. A replication on OpenWebText and the collinearity observation on multi-hop updates add useful detail. They also show a simple scalar can stand in for multi-hop routing and that relative-norm halting trims FLOPs with almost no quality loss. The volume of controlled runs to convergence, multiple seeds, and external dataset check is the strongest part of the work. It gives a concrete, reproducible demonstration that simpler routers can be viable. The main soft spot is isolation. Different routers change token-to-expert assignment statistics, which can shift load balance, effective batch composition, and optimizer trajectories even when expert count and capacity are matched. At 76-84M parameters and fixed 50K steps, those secondary effects are plausible confounders for the narrow observed range, and the paper does not appear to report direct measurements that rule them out. The linear router still wins outright, so the equifinality claim is strongest inside the cosine family. This paper is for people building or scaling MoE language models who want evidence on router simplification. It has enough empirical substance and a practical payoff to deserve serious referee time, even if the controls section needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper claims that routing topology in sparse Mixture-of-Experts (MoE) models does not determine asymptotic language modeling quality. Using a geometric ST-MoE with cosine-similarity routing against learned centroids in a 64-dimensional space (80% fewer routing parameters than linear routers), the authors report 62 controlled experiments on WikiText-103 (76-84M parameters, 50K steps to 1.64B tokens) showing five cosine-routing variants are statistically equivalent within a 1-PPL margin via TOST tests (p<0.05 for all 10 pairs, 15 runs across 3 seeds, PPL range 33.93-34.72). The result extends to hash/random/top-1 routing (minor degradation) and replicates on OpenWebText; a linear router is only ~1.2% better after iso-parameter adjustment. They attribute equifinality to convergent redundancy (multi-hop updates collinear with cos=0.805) and demonstrate 25% FLOP savings via zero-shot relative-norm halting.

Significance. If the equifinality result holds, it would be significant for MoE research by shifting emphasis away from routing sophistication toward expert specialization and other factors, while highlighting practical gains in parameter efficiency and compute. The work earns credit for its scale (62 experiments to convergence, TOST equivalence testing, 15-run seed replication, OpenWebText replication, and mechanistic collinearity measurement), which provides stronger empirical grounding than typical MoE ablations. The finding that topology-level differences yield negligible PPL impact (with graceful degradation for simpler routers) is falsifiable and could influence design choices at this scale.

major comments (2)

[Experimental design (results and methods sections)] Experimental design (results and methods sections): The central claim that PPL equivalence is attributable to routing topology requires explicit isolation from optimization dynamics. The manuscript does not report expert utilization histograms, load-balancing statistics, or per-variant gradient norm distributions, even though different routing functions alter token-to-expert assignments and thus effective batch composition at the 76-84M scale. The narrow observed PPL band (33.93-34.72) means modest confounding in convergence behavior could produce the reported TOST equivalence without proving topology independence.
[§ on mechanistic explanation] § on mechanistic explanation: The collinearity measurement (cos(Δh0, Δh1)=0.805) is presented as explanatory but is post-hoc; the manuscript should clarify whether this was measured on the same runs used for the main equivalence claim or on a separate set, and whether a single scalar multiplier was ablated against the full multi-hop router in the same controlled setup.

minor comments (2)

[Abstract and results] Clarify in the abstract and results whether the 33.93-34.72 PPL range is the min-max across all 62 runs or the per-variant means; also state the exact TOST equivalence margin and power calculation.
[Methods] The claim of '80% fewer routing parameters' should include the exact parameter count for the linear baseline versus the cosine router (including centroid storage) to allow direct verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important aspects of experimental rigor and mechanistic clarity. We address each major comment below, agreeing where revisions are warranted to better isolate the effects of routing topology and to clarify the supporting analyses.

read point-by-point responses

Referee: Experimental design (results and methods sections): The central claim that PPL equivalence is attributable to routing topology requires explicit isolation from optimization dynamics. The manuscript does not report expert utilization histograms, load-balancing statistics, or per-variant gradient norm distributions, even though different routing functions alter token-to-expert assignments and thus effective batch composition at the 76-84M scale. The narrow observed PPL band (33.93-34.72) means modest confounding in convergence behavior could produce the reported TOST equivalence without proving topology independence.

Authors: We agree that reporting these diagnostics would strengthen the claim by more explicitly ruling out optimization confounds. Although the 15-run, multi-seed design and TOST tests already provide statistical safeguards against minor convergence differences, we will add expert utilization histograms, load-balancing statistics (including per-expert token counts and imbalance metrics), and per-variant gradient norm distributions to the revised results and methods sections. These will be computed from the existing training logs to demonstrate that token-to-expert assignment differences do not produce systematically divergent optimization trajectories within the observed PPL range. revision: yes
Referee: § on mechanistic explanation: The collinearity measurement (cos(Δh0, Δh1)=0.805) is presented as explanatory but is post-hoc; the manuscript should clarify whether this was measured on the same runs used for the main equivalence claim or on a separate set, and whether a single scalar multiplier was ablated against the full multi-hop router in the same controlled setup.

Authors: The collinearity measurement (cos(Δh0, Δh1)=0.805) was obtained from the identical set of runs underlying the main equivalence results to maintain experimental consistency. The ablation demonstrating that a single learnable scalar multiplier replicates multi-hop performance was likewise performed within the same controlled 76-84M parameter setup on WikiText-103. We will revise the mechanistic explanation section to explicitly state these details, including the exact experimental conditions and that the scalar ablation used the same training hyperparameters and seeds. revision: yes

Circularity Check

0 steps flagged

No significant circularity: central claim rests on direct experimental measurements

full rationale

The paper's load-bearing claim—that routing topology does not determine asymptotic perplexity—is established through 62 controlled training runs on WikiText-103 (and replications on OpenWebText), with TOST statistical equivalence tests applied to observed PPL values across variants. No mathematical derivation chain, equations, or first-principles results are presented that reduce to their own inputs by construction. The collinearity observation (cos=0.805) and single-scalar replication are reported as post-hoc measurements from the same runs, not as fitted parameters renamed as predictions. No self-citation load-bearing steps, uniqueness theorems, or ansatzes imported via citation appear in the provided text. The work is self-contained against external benchmarks (fixed datasets, fixed training budgets) and does not rely on internal definitions that presuppose the target equivalence.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on empirical training outcomes and statistical tests rather than theoretical axioms or new postulated entities.

free parameters (1)

d_space = 64
Dimension of the routing centroid space set to 64 to achieve parameter reduction.

axioms (1)

domain assumption 50K training steps on 1.64B tokens reaches asymptotic perplexity for 76-84M parameter models
Invoked to treat final PPL as converged performance.

pith-pipeline@v0.9.0 · 5669 in / 1255 out tokens · 32542 ms · 2026-05-10T12:58:56.698609+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mixture of Layers with Hybrid Attention
cs.LG 2026-05 unverdicted novelty 7.0

Mixture of Layers replaces monolithic transformer blocks with routed thin parallel blocks using hybrid attention that combines a shared softmax block for global context with Gated DeltaNet linear attention in the rout...

Reference graph

Works this paper leans on

40 extracted references · 18 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Geometric routing enables causal expert control in mixture of experts.arXiv preprint, 2026

Ivan Ternovtsii and Yurii Bilak. Geometric routing enables causal expert control in mixture of experts.arXiv preprint, 2026

2026
[2]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

GShard: Scaling giant models with conditional computation and automatic sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling giant models with conditional computation and automatic sharding. InInternational Conference on Learning Representations, 2021

2021
[4]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

2022
[5]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. ST-MoE: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022

work page internal anchor Pith review arXiv 2022
[6]

Janus: A unified framework for evaluating and training sparse expert models.arXiv preprint, 2023

Zirui Liu et al. Janus: A unified framework for evaluating and training sparse expert models.arXiv preprint, 2023

2023
[7]

RoFormer: Enhanced trans- former with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced trans- former with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[8]

Using the output embedding to improve language models

Ofir Press and Lior Wolf. Using the output embedding to improve language models. InProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, 2017

2017
[9]

Sigmoid loss for language image pre-training, 2023

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre- training.arXiv preprint arXiv:2303.15343, 2023

work page arXiv 2023
[10]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2017

work page internal anchor Pith review arXiv 2017
[11]

Training compute-optimal large lan- guage models.Advances in Neural Information Processing Systems, 35, 2022

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large lan- guage models.Advances in Neural Information Processing Systems, 35, 2022

2022
[12]

Scaling laws for fine-grained mixture of experts

Jakub Ludziejewski, Jakub Krajewski, Kamil Adamczewski, Sebastian Jaszczur, Szymon Nowak, and Piotr Sankowski. Scaling laws for fine-grained mixture of experts. InInternational Conference on Machine Learning (ICML), 2024. 18 Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling QualityA PREPRINT

2024
[13]

Donald J Schuirmann. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability.Journal of Pharmacokinetics and Biopharmaceutics, 15(6):657–680, 1987

1987
[14]

Deep networks with stochastic depth

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. InEuropean Conference on Computer Vision, 2016

2016
[15]

Mixture of layers: Decomposing MoE transformers into parallel thin blocks

Ivan Ternovtsii and Yurii Bilak. Mixture of layers: Decomposing MoE transformers into parallel thin blocks. Manuscript in preparation, 2026

2026
[16]

Routing matters in MoE: Scaling diffusion transformers with explicit routing guidance.arXiv preprint arXiv:2510.24711, 2025

Alibaba Research. Routing matters in MoE: Scaling diffusion transformers with explicit routing guidance.arXiv preprint arXiv:2510.24711, 2025. ICLR 2026

work page arXiv 2025
[17]

Chain-of-experts: When LLMs meet complex operations research problems.arXiv preprint arXiv:2501.07218, 2025

Ziyang Xiao et al. Chain-of-experts: When LLMs meet complex operations research problems.arXiv preprint arXiv:2501.07218, 2025

work page arXiv 2025
[18]

RoMA: Routing manifold alignment improves generalization of mixture-of-experts LLMs

Tianyi Zhou et al. RoMA: Routing manifold alignment improves generalization of mixture-of-experts LLMs. arXiv preprint arXiv:2511.07419, 2025

work page arXiv 2025
[19]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

OLMoE: Open mixture-of-experts language models

Niklas Muennighoff et al. OLMoE: Open mixture-of-experts language models. InProceedings of the Interna- tional Conference on Learning Representations, 2025

2025
[21]

arXiv preprint arXiv:2509.23678 , year=

Guoliang Zhao, Yuhan Fu, Shuaipeng Li, et al. Towards a comprehensive scaling law of mixture-of-experts. arXiv preprint arXiv:2509.23678, 2025

work page arXiv 2025
[22]

arXiv preprint arXiv:2507.17702 , year=

Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, and Jun Zhou. Towards greater leverage: Scaling laws for efficient mixture-of-experts language models.arXiv preprint arXiv:2507.17702, 2025

work page arXiv 2025
[23]

MoE lens – an expert is all you need.arXiv preprint arXiv:2603.05806, 2026

Marmik Chaudhari, Idhant Gulati, Nishkal Hundia, Pranav Karra, and Shivam Raval. MoE lens – an expert is all you need.arXiv preprint arXiv:2603.05806, 2026

work page arXiv 2026
[24]

BuddyMoE: Exploiting expert redundancy to accelerate memory-constrained mixture-of-experts inference

Yun Wang, Lingyun Yang, Senhao Yu, Yixiao Wang, Ruixing Li, Zhixiang Wei, James Yen, and Zhengwei Qi. BuddyMoE: Exploiting expert redundancy to accelerate memory-constrained mixture-of-experts inference. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2026

2026
[25]

Hash layers for large sparse models

Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam, and Jason Weston. Hash layers for large sparse models. Advances in Neural Information Processing Systems, 2021

2021
[26]

DEMix layers: Disentangling domains for modular language modeling

Suchin Gururangan, Mike Lewis, Anirudh Srivastava, Veselin Stoyanov, and Luke Zettlemoyer. DEMix layers: Disentangling domains for modular language modeling. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics, 2022

2022
[27]

arXiv preprint arXiv:2407.04153 , year=

Xu Owen He. Mixture of a million experts.arXiv preprint arXiv:2407.04153, 2024

work page arXiv 2024
[28]

ReMoE: Fully differentiable mixture-of-experts with ReLU routing

Ziteng Wang, Jun Zhu, and Jianfei Chen. ReMoE: Fully differentiable mixture-of-experts with ReLU routing. In Proceedings of the International Conference on Learning Representations, 2025

2025
[29]

Layerwise recurrent router for mixture-of-experts

Zihan Qiu, Zeyu Huang, Shuang Cheng, Yizhi Zhou, Zili Wang, Ivan Titov, and Jie Fu. Layerwise recurrent router for mixture-of-experts. InProceedings of the International Conference on Learning Representations, 2025

2025
[30]

Statistical advantages of perturbing cosine router in mixture of experts

Huy Nguyen, Pedram Akbarian, Trang Pham, Trang Nguyen, Shujian Zhang, and Nhat Ho. Statistical advantages of perturbing cosine router in mixture of experts. InProceedings of the International Conference on Learning Representations, 2025

2025
[31]

DirMoE: Dirichlet-routed mixture of experts.arXiv preprint arXiv:2602.09001, 2026

Amirhossein Vahidi, Hesam Asadollahzadeh, Navid Akhavan Attar, Marie Moullet, Kevin Ly, Xingyi Yang, and Mohammad Lotfollahi. DirMoE: Dirichlet-routed mixture of experts.arXiv preprint arXiv:2602.09001, 2026

work page arXiv 2026
[32]

Grouter: Decoupling routing from representation for accelerated moe training, 2026

Yuqi Xu, Rizhen Hu, Zihan Liu, Mou Sun, and Kun Yuan. Grouter: Decoupling routing from representation for accelerated MoE training.arXiv preprint arXiv:2603.06626, 2026

work page arXiv 2026
[33]

Raposo, S

David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, and Adam Santoro. Mixture-of-depths: Dynamically allocating compute in transformer-based language models.arXiv preprint arXiv:2404.02258, 2024

work page arXiv 2024
[34]

Universal transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. InInternational Conference on Learning Representations, 2019. 19 Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling QualityA PREPRINT

2019
[35]

Routing absorption in sparse attention: Why random gates are hard to beat.arXiv preprint arXiv:2603.02227, 2026

Keston Aquino-Michaels. Routing absorption in sparse attention: Why random gates are hard to beat.arXiv preprint arXiv:2603.02227, 2026

work page arXiv 2026
[36]

Viet-Hoang Tran, Van Hoan Trinh, Khanh Vinh Bui, and Tan M. Nguyen. On linear mode connectivity of mixture-of-experts architectures. InAdvances in Neural Information Processing Systems, 2025

2025
[37]

standing committee

Yan Wang, Yitao Xu, Nanhan Shen, Jinyan Su, Jimin Huang, and Zining Zhu. The illusion of special- ization: Unveiling the domain-invariant “standing committee” in mixture-of-experts models.arXiv preprint arXiv:2601.03425, 2026

work page arXiv 2026
[38]

Dick, Yuan Cheng, Fan Yang, Tun Lu, Chun Zhang, and Li Shang

Researchers from Fudan, Tsinghua, Michigan, CMU. SD-MoE: Spectral decomposition for effective expert specialization.arXiv preprint arXiv:2602.12556, 2026

work page arXiv 2026
[39]

Understanding cross-layer contributions to MoE routing in LLMs.arXiv preprint, 2026

Wengang Li, Lingqi Zhang, Toshio Endo, and Mohamed Wahib. Understanding cross-layer contributions to MoE routing in LLMs.arXiv preprint, 2026

2026
[40]

Advancing expert specialization for better MoE

Hongcan Guo, Haolang Lu, et al. Advancing expert specialization for better MoE. InAdvances in Neural Information Processing Systems, 2025. A Training Configuration Table 12 provides all hyperparameters used in the convergence marathon (Exps 025–027) and all subsequent experi- ments. Table 12: Complete training configuration for Marathon-scale experiments....

2025