pith. machine review for the scientific record. sign in

arxiv: 2604.09780 · v1 · submitted 2026-04-10 · 💻 cs.AI

Recognition: 3 theorem links

· Lean Theorem

The Myth of Expert Specialization in MoEs: Why Routing Reflects Geometry, Not Necessarily Domain Expertise

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:49 UTC · model grok-4.3

classification 💻 cs.AI
keywords mixture of expertsexpert specializationrouting mechanismshidden state geometryload balancing losslarge language modelsmodel interpretability
0
0 comments X

The pith

Because MoE routers are linear, expert usage similarity is fully determined by hidden-state similarity, making apparent specialization an emergent property of the model's representation geometry rather than any domain-specific expertise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that linear routers turn expert routing decisions into a direct function of how similar two inputs are in the model's hidden representation space, so that what looks like expert specialization simply tracks the geometry already present in those representations. A reader should care because this account implies that designing or interpreting MoEs around the idea of domain-specialized experts rests on a misunderstanding; any observed specialization is inherited from the upstream representation learning rather than created by the routing layer itself. The authors verify the necessity and sufficiency of hidden-state similarity at both token and sequence scales across five pre-trained models and prove that load-balancing losses actively suppress shared directions in hidden space to preserve routing diversity. They further show that real specialization patterns in these models remain largely uninterpretable by humans, with expert overlap between different questions no higher than between unrelated questions.

Core claim

Since MoE routers are linear maps, hidden state similarity is both necessary and sufficient to explain expert usage similarity, and specialization is therefore an emergent property of the representation space, not of the routing architecture itself. This relation holds at token and sequence levels in five pre-trained models. The load-balancing loss is shown to suppress shared hidden-state directions in order to maintain routing diversity, offering a possible explanation for collapse under low-diversity training regimes such as small batches. Despite the clean mechanistic story, specialization patterns resist human interpretation: expert overlap across models on the same question is no higher

What carries the argument

Linear router maps that equate expert-activation similarity with hidden-state similarity via dot-product geometry.

If this is right

  • Load-balancing loss preserves routing diversity by actively suppressing directions that would otherwise be shared across many hidden states.
  • Specialization patterns are expected to collapse when training data lack diversity, such as in small-batch regimes.
  • Prompt-level routing decisions do not reliably predict the routing observed over a full generated sequence.
  • Deeper layers tend to activate nearly identical experts even for semantically unrelated inputs, especially in reasoning-oriented models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Interpretability work on MoEs may gain more by intervening on representation geometry than by redesigning the router itself.
  • The same linear-map argument could extend to other gating or routing components whose weights act as linear classifiers over hidden states.
  • Scaling laws for MoE efficiency may be separable from any gains in human-interpretable specialization, since the latter depends on the still-open problem of hidden-state geometry.

Load-bearing premise

The linearity of the router together with the geometry of hidden states is sufficient to explain routing behavior without major interference from training dynamics or other non-linear components in the network.

What would settle it

Finding pairs of inputs whose hidden states are highly similar yet receive substantially different expert assignments (or vice versa) in any model whose router is a linear map would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.09780 by Eric Nalisnick, Soufiane Hayou, Xi Wang.

Figure 1
Figure 1. Figure 1: Empirical verification of Prop. 1. On the middle layer of 5 models, using 16 sequences from OpenWebText, we compare the RMS distance (Eq. (5)) of all token pairs’ hidden state v.s. their router logits. Bottom left corners: Across all models, similar hidden states induce similar router logits, i.e. similar expert usage. Middle right: When the hidden states are different, the router logits can be similar or … view at source ↗
Figure 2
Figure 2. Figure 2: On gpt-oss-20b and Trinity-Mini-Base, we studied tokens in 16 random sequences from OpenWebText. Lines / bars denote the mean / 1-std. over 16 sequences. Two models use different training configurations, resulting in distinct patterns across depth. Top Left: Token hidden states become more correlated with depth, indicated by stronger energy (Eq. (7)) from the shared direction (first right singular vector).… view at source ↗
Figure 3
Figure 3. Figure 3: We shuffled and reversed 16 sequences from OpenWebText to simulate OOD inputs. Router outputs show less confidence (measured by maximum softmax probability, Eq. (12)) on OOD inputs compared with the original ones (green line), due to the reduced alignment data router alignment (Eq. (4), verified in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pairwise comparisons of inputs drawn from Nemotron-CC-Math-v1, OpenWebText, and across these two datasets, plotting pooled hidden-state similarity (Eq. (17)) against expert-usage frequency similarity (Eq. (14)). Consistent with the token-level findings in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Same token (top) / sentence (bottom), different context, the expert usage is context dependent in deep layers. The preceding analysis examined expert usage at the token level, showing that tokens with similar hidden states tend to route to similar experts. We now extend this view to the sequence level by comparing the co￾sine similarity of pooled hidden states (Eq. (17)) against the cosine similarity of se… view at source ↗
Figure 6
Figure 6. Figure 6: Expert overlap (Jaccard similarity at Top-p=0.8, Eq. (19)) between rollout pairs on 30 HMMT February 2025 questions, averaged over 4 seeds and three MoEs. Different models solving the same question (orange) share only ∼60% of their most-used experts, on par with the same model solving different questions (blue), and well below the same-question, same-model baseline (green). The gap is consistent across all… view at source ↗
Figure 7
Figure 7. Figure 7: Expert-usage similarity between paired queries on Ling-mini, tracked from prompt through rollout. The dashed line marks the prompt boundary. Across all three query pairs, prompt-phase similarity is near-identical; once generation begins, similarity either holds or drops sharply depending on the query and is not predictable from the prompt alone. Understanding expert specialization is at least as hard as un… view at source ↗
Figure 8
Figure 8. Figure 8: On gpt-oss-20b, we keep the 12 most-used experts (out of 32) in the last k layers, using usage frequency from a single hand-written paragraph. First four columns: Pruning up to 2 layers has a negligible effect (<10% NLL increase) on prompt-only inputs, indicating router collapse during prefilling. Last two columns: Adding completions makes the pruned experts consequential, with NLL increasing substantially… view at source ↗
Figure 9
Figure 9. Figure 9: Per-expert activation frequencies for an HLE question (orange, 411 tokens) and a Wim￾bledon Wikipedia passage (purple, 2500 tokens) across four layers and five generation lengths for answering the HLE question. During prefilling, the two unrelated sequences activate nearly identical experts in deeper layers. As the generation proceeds, their activation profiles diverge, with the split emerging earlier in l… view at source ↗
Figure 10
Figure 10. Figure 10: Cross-domain expert-usage similarity under increasing input duplication (1× to 12×) across models; dashed grey shows intra-domain similarity. For models except gpt-oss with chat template (10c), hidden state similarity alone does not produce router collapse under no duplication. Duplication amplifies the dominant hidden-state direction, progressively collapsing cross-domain expert usage toward the intra-do… view at source ↗
Figure 11
Figure 11. Figure 11: Same model architecture, different optimizers induce different feature space properties. At the same layer, given the same sequence, both models show a similar distribution of pairwise cosine similarity (over all tokens). However, embeddings in the model trained with AdamW show, on average, a higher RMS distance, indicating that different tokens exhibit high variation in norm (25 - 35). In contrast, Muon … view at source ↗
Figure 12
Figure 12. Figure 12: On dense model, except for Gemma3, which uses sandwich norm, token similari [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: On Deepseek v2 lite, empirical comparison of the naive bound [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Hidden states similarity (measured by Eq. [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Similar to Fig [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: gpt-oss-20b (w/ auxiliary loss) Direction index 0.0 0.5 1.0 Router-data alignment vs random baseline ||Pvi||2 ER[||Rvi||2] Direction index 0.5 1.0 Router-data alignment vs random baseline (zoomed in) Direction index 0 2000 4000 Layer 1 Singular values Direction index 0 1 Direction index 0.5 1.0 1.5 Direction index 0 2000 4000 Layer 10 Direction index 0 1 Direction index 0.5 1.0 1.5 Direction index 0 2000 … view at source ↗
Figure 17
Figure 17. Figure 17: Trinity Mini Base (Muon, no auxiliary loss) [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: DeepSeek V2 Lite (AdamW, w/ auxiliary loss) [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Moonlight 16B A3B, same architecture as DeepSeek V2 Lite, trianed with Muon [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: gpt-oss-20b [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Ernie 4.5 21B-A3B-Base 26 [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Qwen3-30B-A3B 27 [PITH_FULL_IMAGE:figures/full_fig_p027_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: OOD inputs reduce router data alignment ( [PITH_FULL_IMAGE:figures/full_fig_p028_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Keeping only top-K directions (VK) of the data (i.e. the post attention hidden states, denoted by ht), we study the ratio of expert usage agreement compared with the original inputs, averaged over 50,000 tokens from nvidia/Nemotron-CC-Math-v1 at the 10th layer of each model. The expert receiving the largest logits (brown line, m = 1) shows high agreement over all tokens. Lower rank experts (e.g. the 6th e… view at source ↗
Figure 25
Figure 25. Figure 25: Similiar to Fig [PITH_FULL_IMAGE:figures/full_fig_p029_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: On Ernie-4.5, duplicating sequences increases token hidden states similarity (first [PITH_FULL_IMAGE:figures/full_fig_p030_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Comparison of expert usage under duplication for gpt-oss-20b with (bottom) and [PITH_FULL_IMAGE:figures/full_fig_p030_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Comparison of expert usage under duplication for ERNIE-4.5-21B-A3B with [PITH_FULL_IMAGE:figures/full_fig_p031_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Duplication increases logit magnitude, increasing model confidence and causes [PITH_FULL_IMAGE:figures/full_fig_p031_29.png] view at source ↗
read the original abstract

Mixture of Experts (MoEs) are now ubiquitous in large language models, yet the mechanisms behind their "expert specialization" remain poorly understood. We show that, since MoE routers are linear maps, hidden state similarity is both necessary and sufficient to explain expert usage similarity, and specialization is therefore an emergent property of the representation space, not of the routing architecture itself. We confirm this at both token and sequence level across five pre-trained models. We additionally prove that load-balancing loss suppresses shared hidden state directions to maintain routing diversity, which might provide a theoretical explanation for specialization collapse under less diverse data, e.g. small batch. Despite this clean mechanistic account, we find that specialization patterns in pre-trained MoEs resist human interpretation: expert overlap between different models answering the same question is no higher than between entirely different questions ($\sim$60\%); prompt-level routing does not predict rollout-level routing; and deeper layers exhibit near-identical expert activation across semantically unrelated inputs, especially in reasoning models. We conclude that, while the efficiency perspective of MoEs is well understood, understanding expert specialization is at least as hard as understanding LLM hidden state geometry, a long-standing open problem in the literature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript argues that because MoE routers are linear maps, hidden-state similarity is both necessary and sufficient to explain expert-usage similarity, making apparent expert specialization an emergent property of the learned representation geometry rather than the routing architecture. This is supported by token- and sequence-level empirical checks across five pre-trained models, a proof that load-balancing losses suppress shared hidden-state directions to preserve routing diversity, and negative results showing that specialization patterns resist human interpretation (expert overlap ~60% across models or questions, poor prompt-to-rollout prediction, and near-uniform deep-layer activations for unrelated inputs).

Significance. If the central equivalence holds after clarification, the work supplies a mechanistic account of MoE routing grounded in linearity and representation geometry, together with a theoretical explanation for specialization collapse under low-diversity regimes. The multi-model empirical validation and the load-balancing proof constitute clear strengths that could redirect research from architectural tweaks toward understanding hidden-state geometry.

major comments (1)
  1. Abstract and central theoretical claim: the assertion that hidden-state similarity is 'necessary' (as well as sufficient) for expert-usage similarity does not hold in the full space. With router logits = W h where W ∈ ℝ^{E×d} and E ≪ d, W h1 ≈ W h2 (hence similar top-k expert assignments) is equivalent only to (h1 − h2) lying in ker(W), whose dimension is at least d − E ≫ 0. Thus two hidden states can be arbitrarily dissimilar in Euclidean distance yet produce identical router outputs. Expert-usage similarity therefore does not entail full hidden-state similarity, only similarity in the row space of W. This gap directly undermines the claim that specialization is 'not of the routing architecture itself' but purely emergent from representation-space geometry. The sufficiency direction and the load-balancing proof appear independent of this issue.
minor comments (2)
  1. Empirical sections: the reported confirmation across five models would be strengthened by explicit statements of data exclusion criteria, the precise statistical tests underlying the ~60% overlap figures, and the full derivation of the load-balancing result so that readers can verify the necessity/sufficiency claims.
  2. Notation: clarify throughout whether 'hidden state similarity' refers to full-space Euclidean distance or to similarity after projection onto the router subspace; the distinction is load-bearing for the necessity direction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and precise feedback on the central claim. We agree that the necessity direction requires clarification and will revise the manuscript to address this point directly.

read point-by-point responses
  1. Referee: Abstract and central theoretical claim: the assertion that hidden-state similarity is 'necessary' (as well as sufficient) for expert-usage similarity does not hold in the full space. With router logits = W h where W ∈ ℝ^{E×d} and E ≪ d, W h1 ≈ W h2 (hence similar top-k expert assignments) is equivalent only to (h1 − h2) lying in ker(W), whose dimension is at least d − E ≫ 0. Thus two hidden states can be arbitrarily dissimilar in Euclidean distance yet produce identical router outputs. Expert-usage similarity therefore does not entail full hidden-state similarity, only similarity in the row space of W. This gap directly undermines the claim that specialization is 'not of the routing architecture itself' but purely emergent from representation-space geometry. The sufficiency direction and the load-balancing proof appear independent of this issue.

    Authors: We agree with the referee's observation. Hidden-state similarity in the full Euclidean space is sufficient for similar router outputs but not necessary, as states differing only in ker(W) produce identical logits. We will revise the abstract, introduction, and theoretical discussion to state that expert-usage similarity is governed by similarity of the hidden states' projections onto the row space of W. This does not change the core conclusion: because the router is strictly linear, routing decisions are fully determined by the geometry of the learned hidden states as seen by W, rather than by any non-linear or domain-expertise mechanism internal to the routing architecture. The empirical results across models and the load-balancing proof remain independent of this clarification and continue to support the geometric account. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central claim follows directly from the architectural fact that MoE routers are linear maps (logits = W h), yielding a mathematical implication about projected hidden-state similarity and expert usage; this is not equivalent to its inputs by construction but is a straightforward consequence of linear algebra. The additional load-balancing result is presented as a proof derived from the properties of the loss function rather than a fit to data or a self-referential citation. No steps match the enumerated circularity patterns: there are no self-definitional equivalences, fitted inputs renamed as predictions, load-bearing self-citations, imported uniqueness theorems, smuggled ansatzes, or renamings of known results. The empirical sections across models are presented as confirmations separate from the theoretical derivations, keeping the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the standard architectural fact that MoE routers are linear, plus empirical observations in existing models; no free parameters, invented entities, or ad-hoc axioms beyond this domain assumption are introduced.

axioms (1)
  • domain assumption MoE routers are linear maps
    Invoked as the basis for the necessary-and-sufficient similarity argument in the abstract.

pith-pipeline@v0.9.0 · 5519 in / 1330 out tokens · 49321 ms · 2026-05-10T17:49:31.920188+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EMO: Pretraining Mixture of Experts for Emergent Modularity

    cs.CL 2026-05 conditional novelty 6.0

    EMO uses document-boundary expert pooling during pretraining to induce emergent semantic modularity in MoE models, allowing 25% expert retention with only 1% performance drop.

  2. EMO: Pretraining Mixture of Experts for Emergent Modularity

    cs.CL 2026-05 unverdicted novelty 6.0

    EMO pretrains MoEs using document boundaries to induce semantic expert specialization, enabling modular subset deployment with minimal accuracy loss unlike standard MoEs.

Reference graph

Works this paper leans on

4 extracted references · 1 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  3. [3]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  4. [4]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...