The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level

Jae Hee Lee; Jeremy Herbst; Stefan Wermter

arxiv: 2604.02178 · v2 · pith:CZKZ5SISnew · submitted 2026-04-02 · 💻 cs.CL · cs.AI· cs.LG

The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level

Jeremy Herbst , Stefan Wermter , Jae Hee Lee This is my paper

Pith reviewed 2026-05-21 09:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords mixture of expertspolysemanticityinterpretabilitysparse modelslanguage modelsexpert specializationmechanistic interpretabilityneural network analysis

0 comments

The pith

Mixture-of-Experts models have less polysemantic neurons than dense networks, enabling expert-level interpretation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines whether the built-in sparsity of Mixture-of-Experts architectures makes large language models easier to interpret than traditional dense networks. Through k-sparse probing, it demonstrates that neurons within MoE experts are less polysemantic than those in dense feed-forward layers, and that this advantage grows with increased sparsity in routing. Shifting focus from individual neurons to entire experts as the unit of analysis, the work automatically interprets hundreds of experts to show they specialize in narrow linguistic or semantic operations rather than broad domains or token prediction. A sympathetic reader would care because this suggests a scalable way to understand and potentially control the inner workings of ever-larger models without examining every parameter.

Core claim

The central discovery is that expert neurons in Mixture-of-Experts language models are consistently less polysemantic than neurons in dense feed-forward networks when measured by k-sparse probing, with the gap increasing as the routing becomes sparser. This supports the idea that sparsity encourages monosemantic representations. At the expert level, these components function as fine-grained task experts that handle specific linguistic operations or semantic tasks, such as closing brackets in LaTeX, rather than serving as broad domain specialists or simple token processors.

What carries the argument

k-sparse probing for comparing polysemanticity between expert and dense neurons, together with automated expert interpretation to classify their specializations.

If this is right

Experts become a practical scale for analyzing and editing model behavior.
Greater sparsity in routing may further reduce polysemanticity and improve interpretability.
Large-scale interpretability efforts can target hundreds of experts instead of millions of neurons.
Model capabilities can be attributed to specific task operations performed by individual experts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of future MoE models might deliberately increase routing sparsity to enhance interpretability.
Techniques for editing or steering models could focus on activating or suppressing particular experts for targeted outputs.
Similar analysis might reveal whether dense models contain hidden substructures that mimic expert-like specialization.

Load-bearing premise

That the k-sparse probing method reliably captures the true degree of polysemanticity independent of the specific datasets used for probing and the chosen sparsity parameter k.

What would settle it

Finding a probe dataset or sparsity level k where dense network neurons appear less polysemantic than MoE expert neurons would challenge the main result.

Figures

Figures reproduced from arXiv: 2604.02178 by Jae Hee Lee, Jeremy Herbst, Stefan Wermter.

**Figure 1.** Figure 1: Best-layer F1 score for probes trained on MoE and dense models. Models are matched based on active parameter count, and if available from the same model family. Shaded regions represent 95% confidence intervals around the mean estimate over concepts at each k-value. Red lines represent dense models while blue lines represent MoE models. See Figure 8a in Appendix A for additional model comparisons. 0.6 0.8 … view at source ↗

**Figure 2.** Figure 2: Comparison of best-layer probes trained on MoE experts against probes trained on dense models. MoE models are on the y-axis and dense models are on the x-axis. Models are matched based on active parameter count, and if available from the same model family. See Figure 8b in Appendix A for additional model comparisons. resentation as it exists strictly within the expert’s local subspace, acknowledging that … view at source ↗

**Figure 3.** Figure 3: Comparison of best-layer probes across the OLMo family. Shaded regions represent 95% confidence intervals around the mean estimate over concepts at each k-value. models, the gap in performance is largest at k = 1, where MoE experts often achieve near-perfect F1 scores while dense models struggle. This suggests that sparse routing encourages the model to assign monosemantic neurons to specific concepts, the… view at source ↗

**Figure 6.** Figure 6: Percentage of prompts for which an expert achieved a high rank or did not get routed. Control prompts show the average rank on prompts designed for other experts. By treating each expert as a functional block, we can automatically generate natural language descriptions of their roles and validate these descriptions. In this section, we describe our pipeline for automatic labeling and provide evidence tha… view at source ↗

**Figure 7.** Figure 7: Expert specialization scores across layers for OLMoE-1B-7B. Scores reflect the expert’s deviation from the layer’s aggregate base rate. (Left) Routing Specialization. (Right) Functional Specialization. Routing Specialization analysis, we observe a bimodal structure. The router begins sorting tokens early in the network, followed by a second, more intensive phase of semantic partitioning in the middle lay… view at source ↗

**Figure 9.** Figure 9: Estimated number of experts for each concept. For each concept and layer, experts whose F1 probe score is within 95% of the best expert are counted as active. The concept counts are stacked by layer. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Text examples for OLMoE-L1-E2. Examples are taken from the data the explainer model saw. Highlighted words are tokens routed to this expert which also received a high score. For OLMoE-L1-E2 the generated label was “Mid-word and terminal suffixes within proper nouns, brands, and technical terms” (F1 score: 0.38). See [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Text examples for OLMoE-L11-E5. Examples are taken from the data the explainer model saw. Highlighted words are tokens routed to this expert which also received a high score. For OLMoE-L11-E5 the generated label was “Activates on specific characters to predict the second half of common abbreviations.” (F1 score: 0.46). See [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Text examples for OLMoE-L15-E10. Examples are taken from the data the explainer model saw. Highlighted words are tokens routed to this expert which also received a high score. For OLMoE-L15-E10 the generated label was “Predicts achievement and overcoming verbs following modal verbs, adverbs, and infinitives.” (F1 score: 0.18). See [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Text examples for Qwen3-L24-E76. Examples are taken from the data the explainer model saw. Highlighted words are tokens routed to this expert which also received a high score. For Qwen3-L24-E76 the generated label was “Syntactic elements and connectors within philosophical, legal, or logical propositions and laws.” (F1 score: 0.30). See [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: Text examples for ERNIE-L15-E54. Examples are taken from the data the explainer model saw. Highlighted words are tokens routed to this expert which also received a high score. For ERNIE-L15-E54 the generated label was “Syntactic structures expressing logical explanation, definition, or significance after a demonstrative pronoun.” (F1 score: 0.18). See [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

**Figure 15.** Figure 15: Test case examples from the DLA trigger-target experiment. Trigger words are highlighted in red, while target words are highlighted in blue. G. Cluster Examples We present some example clusters from the k-means clustering in Section 6.2 [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 16.** Figure 16: Text examples for OLMoE-L15-E17. Highlighted words are tokens routed to this expert which also received a high score. Expert 17 activates broadly in contexts containing dense LATEX mathematical notation, especially expressions with nested subscripts and superscripts, matrix/vector symbols (e.g., C, k, H), and operators such as O, ∂, or Γ. While many tokens in these regions are routed to the expert (includ… view at source ↗

**Figure 17.** Figure 17: Text examples for OLMoE-L14-E59. Highlighted words are tokens routed to this expert which also received a high score. Expert 59 is routed primarily on structurally common tokens such as conjunctions (e.g., and), prepositions (e.g., of, into), and other high-frequency connective words that occur in descriptive or explanatory passages. See [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

**Figure 18.** Figure 18: Text examples for OLMoE-L9-E60. Highlighted words are tokens routed to this expert which also received a high score. Expert 60 operates primarily in contexts containing proper nouns and transliterated foreign words, including personal names, place names, and organizational names from diverse linguistic regions (e.g., South Asian, African, Middle Eastern, and East Asian contexts). Tokens that are routed to… view at source ↗

read the original abstract

Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in $\LaTeX{}$). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that Mixture-of-Experts (MoE) language models are inherently more interpretable than dense feed-forward networks because expert neurons exhibit lower polysemanticity (as measured by k-sparse probing), with this gap widening under sparser routing. It further claims that experts function as fine-grained task specialists (e.g., linguistic operations such as bracket closing in LaTeX) rather than broad domain or token-level processors, validated through automatic interpretation of hundreds of experts, and concludes that MoEs provide a clearer path to large-scale interpretability.

Significance. If the empirical results hold after addressing methodological controls, the work would offer a practical shift in interpretability analysis from individual neurons to the expert level in MoE models, along with evidence that sparsity induces monosemanticity. The public availability of code at https://github.com/jerryy33/MoE_analysis is a strength that supports reproducibility of the probing and interpretation pipelines.

major comments (2)

[Abstract] Abstract: the central claim that expert neurons are consistently less polysemantic than those in dense FFNs, with the gap widening as routing becomes sparser, rests on k-sparse probing without reported ablations on probe dataset composition, cross-model normalization of the sparsity level k, or sensitivity tests. This leaves open the possibility that the measured gap arises from interactions between native MoE routing and the fixed probe distribution rather than intrinsic monosemanticity.
[Abstract] Abstract (validation step): the automatic interpretation of hundreds of experts to resolve the specialization debate lacks detail on the probe construction, clustering or labeling procedure, and controls for selection bias or inter-expert consistency, making it difficult to assess whether the fine-grained task specialization (e.g., LaTeX bracket closing) is robust or an artifact of the chosen examples.

minor comments (1)

[Abstract] The abstract refers to 'linguistic operations or semantic tasks' without providing concrete quantitative metrics (e.g., activation frequencies or task-specific accuracy) that would allow readers to gauge the granularity of specialization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us improve the clarity and robustness of our work. We address each major comment in detail below, providing additional methodological details and ablations where necessary.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that expert neurons are consistently less polysemantic than those in dense FFNs, with the gap widening as routing becomes sparser, rests on k-sparse probing without reported ablations on probe dataset composition, cross-model normalization of the sparsity level k, or sensitivity tests. This leaves open the possibility that the measured gap arises from interactions between native MoE routing and the fixed probe distribution rather than intrinsic monosemanticity.

Authors: We acknowledge the importance of these controls to rule out confounds. In the revised manuscript, we include new experiments ablating the probe dataset composition using both in-distribution and out-of-distribution probes. We also normalize k by the average activation sparsity per model and perform sensitivity analysis for k values around the chosen level. These additional results confirm that the polysemanticity gap is robust and not an artifact of the probe distribution or routing interaction. We have added a dedicated paragraph in the methods and results sections describing these ablations. revision: yes
Referee: [Abstract] Abstract (validation step): the automatic interpretation of hundreds of experts to resolve the specialization debate lacks detail on the probe construction, clustering or labeling procedure, and controls for selection bias or inter-expert consistency, making it difficult to assess whether the fine-grained task specialization (e.g., LaTeX bracket closing) is robust or an artifact of the chosen examples.

Authors: We agree that additional details on the interpretation pipeline are warranted. We have revised the manuscript to provide a comprehensive description of the automatic interpretation method, including: the construction of activation probes from a diverse set of inputs, the use of hierarchical clustering followed by automated labeling with a language model and human verification, and controls such as shuffling expert labels to test for bias and measuring consistency across different data subsets. These additions demonstrate that the observed fine-grained specializations are consistent and not due to selection bias or specific examples. A new appendix details the full procedure and reports quantitative consistency metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison via external k-sparse probing

full rationale

The paper reports direct empirical measurements of polysemanticity using k-sparse probing on MoE experts versus dense FFNs, followed by automated interpretation of expert activations. No equations, fitted parameters, or self-citations are shown to reduce the reported specialization gap or task-expert findings to quantities defined by the authors' own inputs. The central claims rest on observed activation patterns over probe data rather than any self-definitional or tautological construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard mechanistic-interpretability assumptions about what probing measures and on the validity of automatic expert descriptions; no new physical constants or ad-hoc entities are introduced.

free parameters (1)

sparsity level k
Hyperparameter controlling how many top activations are considered in the probing experiments; chosen by the authors.

axioms (1)

domain assumption k-sparse probing measures the number of distinct concepts a unit participates in
The method is taken as a proxy for polysemanticity without further justification in the abstract.

pith-pipeline@v0.9.0 · 5766 in / 1357 out tokens · 79672 ms · 2026-05-21T09:32:25.823183+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W

https://transformer-circuits.pub/2025/ attribution-graphs/methods.html. Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W. Language models can explain neurons in language models. https://openaipublic.blob.core.windows. net/neuron-explainer/paper/index.html, 2023. Bricken, T., Templeton,...

work page 2025
[2]

Chaudhari, M., Nuer, J., and Thorstenson, R

https://transformer-circuits.pub/2023/ monosemantic-features/index.html. Chaudhari, M., Nuer, J., and Thorstenson, R. Superposition in mixture of experts. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025. Chughtai, B., Cooney, A., and Nanda, N. Summing up the facts: Additive mechanisms behind factual recall in llms. arXiv preprint arXiv:2402....

work page arXiv 2023
[3]

https://transformer-circuits.pub/2021/ framework/index.html. Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., and Olah, C. Toy mod- els of superposition.Transformer Circuits Thread, 10 The Expert Strikes Back: ...

work page 2021
[4]

arXiv preprint arXiv:2511.13653 , year =

https://transformer-circuits.pub/2022/ toy_model/index.html. Gao, L., Rajaram, A., Coxon, J., Govande, S. V ., Baker, B., and Mossing, D. Weight-sparse transformers have in- terpretable circuits.arXiv preprint arXiv:2511.13653, 2025. Geva, M., Schuster, R., Berant, J., and Levy, O. Transformer feed-forward layers are key-value memories. InProceed- ings of...

work page arXiv 2022
[5]

Tokens routed to this expert are wrapped in double asterisks (e.g., **token**)

<snippet>: The raw text. Tokens routed to this expert are wrapped in double asterisks (e.g., **token**)

work page
[6]

-'score': The obtained score for that token

<top_activations>: A list of the top active tokens in that snippet (up to 5), sorted by an importance score (Router Weight * Output L2 Norm). -'score': The obtained score for that token. -'token_str': The string representation. -'promoted_tokens': The top 3 tokens the expert predicted next (Logit Lens). </data_structure> <guidelines>

work page
[7]

**Analyze Density:** Does the expert activate sporadically (specific entities) or continuously (syntactic blocks)?

work page
[8]

If an expert activates on'New', and promotes'York','Zealand','Jersey', it is a named-entity completer

**Consult Logit Lens:** Use the'promoted_tokens'to understand the *effect* of the expert. If an expert activates on'New', and promotes'York','Zealand','Jersey', it is a named-entity completer

work page
[9]

Find the common thread across all examples

**Generalize:** Do not overfit to a single example. Find the common thread across all examples

work page
[10]

1"> <snippet> and whistles

**Formatting:** Ignore the`**`markers when analyzing the natural flow of text; they are only for highlighting. </guidelines> EXPLAINER USER PROMPT <context> Here are the maximal activating examples for Expert 17. </context> <data> <example id="1"> <snippet> and whistles” you need to create more complex** quizzes** and** surveys**. These** are** the** feat...

work page
[11]

A **Hypothesis** describing the function of a specific MoE Expert

work page
[12]

text": "A short text snippet where you would expect the MoE expert to be active

A list of **Test Examples**. Each example contains a text snippet, where active tokens are highlighted with double asterisks (e.g., **token**). Your job is to determine: **Does the highlighted token pattern in the example match the Hypothesis?** - If the highlighted tokens fit the hypothesis description: Output 1. - If the highlighted tokens clearly viola...

work page

[1] [1]

Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W

https://transformer-circuits.pub/2025/ attribution-graphs/methods.html. Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W. Language models can explain neurons in language models. https://openaipublic.blob.core.windows. net/neuron-explainer/paper/index.html, 2023. Bricken, T., Templeton,...

work page 2025

[2] [2]

Chaudhari, M., Nuer, J., and Thorstenson, R

https://transformer-circuits.pub/2023/ monosemantic-features/index.html. Chaudhari, M., Nuer, J., and Thorstenson, R. Superposition in mixture of experts. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025. Chughtai, B., Cooney, A., and Nanda, N. Summing up the facts: Additive mechanisms behind factual recall in llms. arXiv preprint arXiv:2402....

work page arXiv 2023

[3] [3]

https://transformer-circuits.pub/2021/ framework/index.html. Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., and Olah, C. Toy mod- els of superposition.Transformer Circuits Thread, 10 The Expert Strikes Back: ...

work page 2021

[4] [4]

arXiv preprint arXiv:2511.13653 , year =

https://transformer-circuits.pub/2022/ toy_model/index.html. Gao, L., Rajaram, A., Coxon, J., Govande, S. V ., Baker, B., and Mossing, D. Weight-sparse transformers have in- terpretable circuits.arXiv preprint arXiv:2511.13653, 2025. Geva, M., Schuster, R., Berant, J., and Levy, O. Transformer feed-forward layers are key-value memories. InProceed- ings of...

work page arXiv 2022

[5] [5]

Tokens routed to this expert are wrapped in double asterisks (e.g., **token**)

<snippet>: The raw text. Tokens routed to this expert are wrapped in double asterisks (e.g., **token**)

work page

[6] [6]

-'score': The obtained score for that token

<top_activations>: A list of the top active tokens in that snippet (up to 5), sorted by an importance score (Router Weight * Output L2 Norm). -'score': The obtained score for that token. -'token_str': The string representation. -'promoted_tokens': The top 3 tokens the expert predicted next (Logit Lens). </data_structure> <guidelines>

work page

[7] [7]

**Analyze Density:** Does the expert activate sporadically (specific entities) or continuously (syntactic blocks)?

work page

[8] [8]

If an expert activates on'New', and promotes'York','Zealand','Jersey', it is a named-entity completer

**Consult Logit Lens:** Use the'promoted_tokens'to understand the *effect* of the expert. If an expert activates on'New', and promotes'York','Zealand','Jersey', it is a named-entity completer

work page

[9] [9]

Find the common thread across all examples

**Generalize:** Do not overfit to a single example. Find the common thread across all examples

work page

[10] [10]

1"> <snippet> and whistles

**Formatting:** Ignore the`**`markers when analyzing the natural flow of text; they are only for highlighting. </guidelines> EXPLAINER USER PROMPT <context> Here are the maximal activating examples for Expert 17. </context> <data> <example id="1"> <snippet> and whistles” you need to create more complex** quizzes** and** surveys**. These** are** the** feat...

work page

[11] [11]

A **Hypothesis** describing the function of a specific MoE Expert

work page

[12] [12]

text": "A short text snippet where you would expect the MoE expert to be active

A list of **Test Examples**. Each example contains a text snippet, where active tokens are highlighted with double asterisks (e.g., **token**). Your job is to determine: **Does the highlighted token pattern in the example match the Hypothesis?** - If the highlighted tokens fit the hypothesis description: Output 1. - If the highlighted tokens clearly viola...

work page