arxiv: 2604.14434 · v1 · submitted 2026-04-15 · 💻 cs.AI

Recognition: unknown

Geometric Routing Enables Causal Expert Control in Mixture of Experts

Ivan Ternovtsii , Yurii Bilak

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:49 UTC · model grok-4.3

classification 💻 cs.AI

keywords mixture of expertsmodel interpretabilitycausal interventionsmonosemantic expertsgeometric routingsemantic dictionaryexpert specialization

0 comments

The pith

Cosine-similarity routing makes individual MoE experts monosemantic and causally controllable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that sparse Mixture-of-Experts models with rank-1 experts and cosine-similarity routing produce monosemantic specialists whose identities are directly readable and causally effective. Projecting expert outputs through the unembedding matrix reveals a Semantic Dictionary of clear categories, while routing layers exhibit a shift from frequency-based to syntax-based decisions. Causal tests confirm that steering, suppressing, or rewriting expert vectors alters output probabilities in the expected directions, with effects that add across layers. This matters for interpretability because it provides an architectural way to make expert behavior transparent and adjustable at inference time without extra computation.

Core claim

Individual rank-1 experts are monosemantic by construction, and cosine-similarity routing in a low-dimensional metric space makes their specialization directly inspectable. Projecting expert output vectors through the unembedding matrix yields a Semantic Dictionary in which 15% of experts are monosemantic specialists spanning 10 categories. Routing exhibits a frequency-to-syntax gradient across layers. Causal interventions validate the labels: steering toward a temporal expert centroid increases P(temporal) by +321%, suppressing a geographic expert drops P(geographic) by -23%, and rewriting an expert output vector halves target-category probability, with additive effects across layers. The 1

What carries the argument

Cosine-similarity routing to low-dimensional centroids, which selects experts by vector similarity and allows specialization to be read directly from the centroid matrix.

If this is right

15 percent of experts function as monosemantic specialists in categories including temporal, geographic, cardinal, discourse, emotional, financial, military, and scientific.
Early layers route tokens primarily by word frequency while deeper layers route by syntactic class, with statistical significance.
Steering an expert's output toward its centroid increases the probability of its associated semantic category by a median of 321 percent across prompts.
Suppressing or rewriting an expert's output vector decreases the probability of target categories, and these effects compose additively when applied across multiple layers.
Linear routers permit similar causal control, yet only the cosine approach enables direct inspection of expert identities from geometry alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the geometric transparency holds, then MoE architectures could be designed from the start to support built-in interpretability rather than requiring post-training analysis.
The frequency-to-syntax progression in routing might reflect a natural hierarchy in how information is processed in language models.
Testing whether other sparse activation methods produce similar monosemantic experts when given geometric routing would extend the finding.
Such direct control suggests applications in safe model deployment where specific expert behaviors can be modulated without retraining.

Load-bearing premise

That projecting expert output vectors through the unembedding matrix yields faithful semantic labels rather than artifacts of the projection or training data distribution.

What would settle it

Finding that causal interventions on the identified experts produce no significant change in the predicted probabilities for the corresponding semantic categories, or that the projected labels do not align with actual token distributions.

Figures

Figures reproduced from arXiv: 2604.14434 by Ivan Ternovtsii, Yurii Bilak.

**Figure 2.** Figure 2: Polysemy Branching. The token “bank” routes through 86% different experts depending on river vs. finance context. Shared experts (yellow) handle syntactic obligations; unique experts (teal/red) encode context-specific semantics. 8 Causal Interventions The preceding sections establish that experts develop interpretable specializations and that routing organizes tokens semantically. However, these are correl… view at source ↗

**Figure 3.** Figure 3: Routing Statistics by Layer. Both Wide and Deep architectures exhibit a clear gradient: early layers (L0–L2) route selectively (high Gini, low entropy, high max probability), while late layers (L5–L7) distribute nearuniformly across experts. This pattern holds regardless of routing topology. cardinal +63%, geographic +41%, and discourse +7%. Across 44 diverse prompts (category-relevant, neutral, and adver… view at source ↗

read the original abstract

Sparse Mixture-of-Experts (MoE) models scale parameters while fixing active computation per token, but the specialization of individual experts remains opaque. In a companion paper we showed that routing topology is quality-neutral: five structurally different configurations converge to statistically equivalent language modeling quality. Here we show that expert identity is nonetheless causally meaningful: individual rank-1 experts are monosemantic by construction, and cosine-similarity routing in a low-dimensional metric space makes their specialization directly inspectable. We present four lines of evidence. First, projecting expert output vectors through the unembedding matrix yields a Semantic Dictionary: 15% of experts are monosemantic specialists spanning 10 categories (temporal, geographic, cardinal, discourse, emotional, financial, military, scientific). Second, routing exhibits a frequency-to-syntax gradient: early layers separate tokens by word frequency, deeper layers by syntactic class (Zipf-confound controls, all $p < 0.001$). Third, causal interventions confirm these labels: steering toward a temporal expert's centroid increases P(temporal) by +321% (median across 44 prompts); suppressing a geographic expert drops P(geographic) by -23%; rewriting an expert's output vector halves target-category probability, and effects compose additively across layers. Fourth, the interventions are not unique to cosine routing: linear routers support comparable steering, but only cosine routing provides geometric transparency -- expert specialization is readable directly from the centroid matrix. MoE expert-level specialization is a first-class interpretability primitive: architecturally monosemantic, causally validated, and controllable at inference with zero overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that sparse Mixture-of-Experts (MoE) models possess causally meaningful expert identities, with rank-1 experts being monosemantic by construction. Cosine-similarity routing in a low-dimensional metric space renders specialization directly inspectable via a Semantic Dictionary obtained by projecting expert output vectors through the unembedding matrix (identifying 15% of experts as specialists across 10 categories such as temporal and geographic). Routing exhibits a frequency-to-syntax gradient (early layers by token frequency, deeper by syntactic class, with Zipf-confound controls at p<0.001). Causal interventions (centroid steering, suppression, vector rewriting) validate the labels with large effects (+321% median increase in P(temporal) across 44 prompts; -23% drop in P(geographic); additive composition across layers) and show that linear routers support comparable steering while cosine routing uniquely provides geometric transparency.

Significance. If the central claims hold after addressing validation gaps, this work would establish expert-level specialization as a first-class, architecturally grounded interpretability primitive in MoE models, enabling zero-overhead causal control at inference and direct geometric inspection of specialization. The large, statistically significant intervention deltas, additive effects, and contrast with linear routers provide concrete, falsifiable evidence for controllability that builds directly on the companion paper's topology-neutrality result; this could shift MoE analysis from opaque scaling to explicit expert manipulation.

major comments (3)

[Abstract] Abstract (Semantic Dictionary paragraph): The monosemanticity claim and all downstream causal interventions rest on the unembedding projection yielding faithful category labels, yet no independent verification (activation patching on held-out features, alternative projection methods, or probe-based validation) is described to distinguish intrinsic expert semantics from artifacts of the unembedding matrix, residual stream statistics, or training co-occurrences. This is load-bearing because the reported P(category) measurements and intervention targets are derived from the same labeling procedure.
[Abstract] Abstract (causal interventions paragraph): The median effects (+321% for temporal steering, -23% for geographic suppression) are presented without error bars, without details on the 44 prompts (selection criteria, diversity, or stratification), and without an ablation of the low-dimensional projection step. These omissions prevent assessment of whether the p<0.001 significance and large deltas are robust or sensitive to prompt choice and projection artifacts.
[Routing gradient] Routing gradient description: The frequency-to-syntax gradient is supported by Zipf-confound controls, but the manuscript provides no explicit description of how prompt selection or token sampling enforces the confound controls, nor the precise statistical test used to establish layer-wise separation (p<0.001). This detail is required to confirm the gradient is not an artifact of residual frequency correlations.

minor comments (2)

[Abstract] The companion paper on topology neutrality is referenced without a full citation or arXiv identifier in the text.
[General] Notation for the low-dimensional metric space, centroid matrix, and P(category) probability should be introduced with a brief formal definition or equation in the main text for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. We agree that additional validation and details are needed to strengthen the manuscript, particularly regarding the robustness of the Semantic Dictionary and the causal intervention results. Below we address each major comment point by point, indicating where revisions will be made.

read point-by-point responses

Referee: [Abstract] Abstract (Semantic Dictionary paragraph): The monosemanticity claim and all downstream causal interventions rest on the unembedding projection yielding faithful category labels, yet no independent verification (activation patching on held-out features, alternative projection methods, or probe-based validation) is described to distinguish intrinsic expert semantics from artifacts of the unembedding matrix, residual stream statistics, or training co-occurrences. This is load-bearing because the reported P(category) measurements and intervention targets are derived from the same labeling procedure.

Authors: We agree that independent verification of the Semantic Dictionary labels would strengthen the claims. The unembedding projection is motivated by the logit lens approach commonly used in interpretability literature to read out semantic information from residual stream activations. However, to address potential artifacts, in the revised manuscript we will include an ablation study using linear probes trained on held-out data to validate the category assignments independently of the unembedding matrix. We will also report agreement rates between the projection-based labels and probe predictions. This will help distinguish intrinsic semantics from projection artifacts. revision: yes
Referee: [Abstract] Abstract (causal interventions paragraph): The median effects (+321% for temporal steering, -23% for geographic suppression) are presented without error bars, without details on the 44 prompts (selection criteria, diversity, or stratification), and without an ablation of the low-dimensional projection step. These omissions prevent assessment of whether the p<0.001 significance and large deltas are robust or sensitive to prompt choice and projection artifacts.

Authors: We acknowledge the need for more transparency in reporting the intervention results. In the revision, we will add error bars (e.g., interquartile range across prompts) to the median effects. We will provide details on the 44 prompts, including their selection criteria, diversity, and stratification by category. Additionally, we will include an ablation of the low-dimensional projection step to assess its impact on the results. We will also explicitly describe the statistical test used to obtain the p<0.001 significance and include the full distribution of effects. revision: yes
Referee: [Routing gradient] Routing gradient description: The frequency-to-syntax gradient is supported by Zipf-confound controls, but the manuscript provides no explicit description of how prompt selection or token sampling enforces the confound controls, nor the precise statistical test used to establish layer-wise separation (p<0.001). This detail is required to confirm the gradient is not an artifact of residual frequency correlations.

Authors: We will expand the manuscript to provide an explicit description of how prompt selection and token sampling enforce the Zipf-confound controls. We will also specify the precise statistical test used to establish the layer-wise separation (p<0.001). This will allow readers to confirm that the gradient is not an artifact of residual frequency correlations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent causal tests

full rationale

The paper's central claims rest on four lines of evidence: semantic dictionary from unembedding projection, frequency-to-syntax routing gradient with statistical controls, causal interventions (steering, suppression, rewriting) that measure changes in category probabilities, and comparison to linear routers. The companion paper citation establishes only that topology is quality-neutral as background context; the present claims about monosemanticity and inspectability do not reduce to that citation or to any fitted parameter by construction. No equation or definition equates a reported effect to its own input, and the causal interventions provide an external check on the projection-derived labels rather than assuming them tautologically. The work is self-contained against its stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the assumption that the unembedding projection faithfully recovers semantic categories and that centroid interventions isolate expert contributions without confounding side effects on other experts or layers.

axioms (2)

domain assumption Cosine similarity in the chosen low-dimensional space preserves semantic distinctions relevant to token routing.
Invoked when defining the routing function and claiming geometric transparency.
domain assumption The unembedding matrix maps expert outputs to human-interpretable token distributions without introducing spurious category alignments.
Required for the Semantic Dictionary construction and all downstream causal claims.

invented entities (1)

Semantic Dictionary no independent evidence
purpose: A lookup table of monosemantic expert categories obtained by projecting expert outputs through the unembedding matrix.
New construct introduced to label experts; no independent evidence outside the reported projections is provided.

pith-pipeline@v0.9.0 · 5584 in / 1553 out tokens · 26226 ms · 2026-05-10T12:49:39.270647+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 16 canonical work pages · 6 internal anchors

[1]

Equifinality in mixture of experts: Routing topology does not determine language modeling quality.arXiv preprint, 2026

Ivan Ternovtsii and Yurii Bilak. Equifinality in mixture of experts: Routing topology does not determine language modeling quality.arXiv preprint, 2026

2026
[2]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv 9 Geometric Routing Enables Causal Expert Control in Mixture of ExpertsA PREPRINT preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Damai Dai, Chengqi Deng, Chenggang Zhao, R.X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y . Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066, 2024

work page internal anchor Pith review arXiv 2024
[4]

Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

2023
[5]

interpreting GPT: the logit lens.LessWrong, 2020

nostalgebraist. interpreting GPT: the logit lens.LessWrong, 2020

2020
[6]

Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space

Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

2022
[7]

Toy models of superposition.Transformer Circuits Thread, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield- Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.Transformer Circuits Thread, 2022

2022
[8]

What does BERT look at? an analysis of BERT’s attention

Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. What does BERT look at? an analysis of BERT’s attention. InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286, 2019

2019
[9]

BERT Rediscovers the Classical NLP Pipeline , publisher =

Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline.arXiv preprint arXiv:1905.05950, 2019

work page arXiv 1905
[10]

arXiv preprint arXiv:2509.23678 , year=

Guoliang Zhao, Yuhan Fu, Shuaipeng Li, et al. Towards a comprehensive scaling law of mixture-of-experts. arXiv preprint arXiv:2509.23678, 2025

work page arXiv 2025
[11]

arXiv preprint arXiv:2507.17702 , year=

Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, and Jun Zhou. Towards greater leverage: Scaling laws for efficient mixture-of-experts language models.arXiv preprint arXiv:2507.17702, 2025

work page arXiv 2025
[12]

Sparse autoencoders do not find canoni- cal units of analysis.Proceedings of the International Conference on Learning Representations, 2025

Patrick Leask, Joshua Mendel, Stepan Boettiger, Nikhil Mulligan, et al. Sparse autoencoders do not find canoni- cal units of analysis.Proceedings of the International Conference on Learning Representations, 2025

2025
[13]

2024 , eprint=

David Chanin, James Wilken-Smith, Tomáš Dulka, Hardik Bhatnagar, and Joseph Bloom. A is for absorption: Studying feature splitting and absorption in sparse autoencoders.arXiv preprint arXiv:2409.14507, 2024

work page arXiv 2024
[14]

Decomposing the dark matter of sparse autoencoders.Transactions on Machine Learning Research, 2025

Joshua Engels, Isaac Liao, and Max Tegmark. Decomposing the dark matter of sparse autoencoders.arXiv preprint arXiv:2410.14670, 2024

work page arXiv 2024
[15]

Inference-time interven- tion: Eliciting truthful answers from a language model

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time interven- tion: Eliciting truthful answers from a language model. InAdvances in Neural Information Processing Systems, volume 36, 2024

2024
[16]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023

work page internal anchor Pith review arXiv 2023
[17]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte Pelrine. Activation addition: Steering language models without optimization.arXiv preprint arXiv:2308.10248, 2023

work page internal anchor Pith review arXiv 2023
[18]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems, volume 35, 2022

2022
[19]

SteerMoE: Steering mixture-of-experts LLMs via expert (de)activation.arXiv preprint arXiv:2509.09660, 2025

Mohsen Fayyaz et al. SteerMoE: Steering mixture-of-experts LLMs via expert (de)activation.arXiv preprint arXiv:2509.09660, 2025

work page arXiv 2025
[20]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[21]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

2022
[22]

GShard: Scaling giant models with conditional computation and automatic sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling giant models with conditional computation and automatic sharding. InInternational Conference on Learning Representations, 2021

2021
[23]

Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet.Transformer Circuits Thread, 2024

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C Daniel Freeman, Theodore R Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan 10 Geometric Routing Enables Causal Expert Control in Mi...

2024
[24]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093, 2024

work page internal anchor Pith review arXiv 2024
[25]

Sparsity and superposition in mixture of experts.arXiv preprint arXiv:2510.23671, 2025

Tianyi Chen et al. Sparsity and superposition in mixture of experts.arXiv preprint arXiv:2510.23671, 2025

work page arXiv 2025
[26]

MONET: Mixture of monosemantic experts for transformers

Jungwoo Park, Young Jin Ahn, Kee-Eung Kim, and Jaewoo Kang. MONET: Mixture of monosemantic experts for transformers. InProceedings of the International Conference on Learning Representations, 2025

2025
[27]

Understanding SAE features with the logit lens.LessWrong / Alignment F orum, 2024

Joseph Bloom and Curt Tigges Lin. Understanding SAE features with the logit lens.LessWrong / Alignment F orum, 2024

2024
[28]

SAEs are good for steering – if you select the right features

Ido Arad, Aaron Mueller, and Yonatan Belinkov. SAEs are good for steering – if you select the right features. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

2025
[29]

Why should I trust you?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “Why should I trust you?”: Explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016

2016
[30]

A unified approach to interpreting model predictions

Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017
[31]

CELL your model: Contrastive explanations for large lan- guage models.arXiv preprint arXiv:2406.11785, 2024

Ronny Luss, Erik Miehling, and Amit Dhurandhar. CELL your model: Contrastive explanations for large lan- guage models.arXiv preprint arXiv:2406.11785, 2024

work page arXiv 2024
[32]

Explaining large language models with gSMILE.arXiv preprint arXiv:2505.21657, 2025

Zeinab Dehghani, Mohammed Naveed Akram, Koorosh Aslansefat, and Adil Khan. Explaining large language models with gSMILE.arXiv preprint arXiv:2505.21657, 2025

work page arXiv 2025
[33]

standing committee

Yan Wang, Yitao Xu, Nanhan Shen, Jinyan Su, Jimin Huang, and Zining Zhu. The illusion of special- ization: Unveiling the domain-invariant “standing committee” in mixture-of-experts models.arXiv preprint arXiv:2601.03425, 2026. 11

work page arXiv 2026