Expert-Aware Causal Tracing of Factual Recall in Sparse MoE Language Models

Ali Modarressi; Hinrich Sch\"utze; Yihong Liu; Yuetian Lu

arxiv: 2606.03780 · v1 · pith:ZM3BKB2Jnew · submitted 2026-06-02 · 💻 cs.CL · cs.LG

Expert-Aware Causal Tracing of Factual Recall in Sparse MoE Language Models

Yuetian Lu , Ali Modarressi , Yihong Liu , Hinrich Sch\"utze This is my paper

Pith reviewed 2026-06-28 10:17 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords causal tracingmixture of expertsfactual recallmodel interpretabilitysparse MoEknowledge localizationQwen3Mixtral

0 comments

The pith

Expert-aware causal tracing localizes factual recall to specific routed experts in sparse MoE language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to trace which individual experts in mixture-of-experts models carry factual information during prediction. It corrupts subject tokens with noise to disrupt the fact, then restores either full MoE outputs or individual expert updates to see what recovers the correct answer over a foil. For one model, this pins the effect to a single expert in a particular layer; for another, it requires updates from multiple routed experts together. A sympathetic reader would care because this extends interpretability techniques from dense models to the more modular sparse ones now in wide use, potentially allowing targeted inspection or editing of knowledge in large models.

Core claim

The authors formulate expert-aware causal tracing by patching clean expert-level updates after subject corruption and show that for Qwen3-30B-A3B-Base this identifies L44E069 as the key expert whose patch outperforms others at layer 44, while for Mixtral-8x7B-v0.1 the signal appears only when multiple experts are updated together rather than any singleton.

What carries the argument

Expert-aware causal tracing, which tests restoration of true-vs-foil logit contrast by patching individual expert outputs in routed MoE blocks after corrupting subject embeddings.

If this is right

Layer 44 is selected and validated for Qwen3 via a sweep, with expert L44E069 showing repeated selection and superior patch performance.
For Mixtral, mid-layer signals exist but require coalition checks with routed multi-expert updates to recover.
Expert-level localization is model- and protocol-dependent.
MoE factual tracing can be made expert-aware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Expert tracing could enable more precise knowledge editing in MoE models by targeting only the relevant experts.
This approach might generalize to other sparse architectures beyond the two tested.
Downstream applications could include verifying if experts specialize in particular types of facts.

Load-bearing premise

Restoring clean expert-level updates after subject-token corruption accurately isolates the causal contribution of individual experts rather than reflecting downstream routing or residual effects.

What would settle it

Observing that patching a non-selected expert at the same layer restores the logit contrast as effectively as the identified expert would falsify the localization claim.

Figures

Figures reproduced from arXiv: 2606.03780 by Ali Modarressi, Hinrich Sch\"utze, Yihong Liu, Yuetian Lu.

**Figure 1.** Figure 1: Main validation pattern. Left: MoE-block rescue across layers. Middle: selected-expert specificity. Right: [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

read the original abstract

Causal tracing of factual recall has been studied predominantly in dense transformer language models, where interventions localize information flow to layers or feed-forward modules. Sparse mixture-of-experts (MoE) language models introduce a sharper question: when a factual prediction is mediated by a routed MoE block, which routed expert contributions matter? We formulate expert-aware causal tracing for sparse MoE language models. Using CounterFact facts, we first corrupt the model's factual preference by adding noise to subject-token embeddings, and then test whether clean MoE-block outputs or clean expert-level updates restore the true-vs-foil logit contrast. For Qwen3-30B-A3B-Base, a layer sweep selects and validates layer 44, and expert-level tracing identifies L44E069 as an expert repeatedly selected in the clean run whose held-out patch outperforms other active same-layer expert patches. For Mixtral-8x7B-v0.1, layer-level tracing validates a mid-layer signal, but the signal is not localized to the selected singleton expert; a coalition check instead recovers it with routed multi-expert updates. These results suggest that MoE factual tracing can be made expert-aware, while also showing that expert-level localization is model- and protocol-dependent rather than universal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts causal tracing to routed experts in MoE models and shows model-dependent localization, but routing changes from the corruption step may confound the expert-level claims.

read the letter

The main point is that they extend causal tracing to the expert level inside MoE blocks. They corrupt subject-token embeddings, then restore either full MoE outputs or individual expert activations and measure recovery of the true-vs-foil logit contrast. On Qwen3-30B-A3B they identify one expert at layer 44 whose held-out patch outperforms other same-layer experts; on Mixtral the signal only recovers with multi-expert updates. That difference is the concrete new observation.

The method itself is a straightforward extension of prior dense-model work and uses the standard CounterFact setup. Reporting that singleton localization is not universal across models is useful information for anyone working on MoE interpretability.

The soft spot is the one flagged in the stress test. Corrupting the subject changes router logits at every layer, so patching a single expert that was not selected under corruption could restore the logit by fixing the routing decision rather than by restoring the expert's stored fact. The Mixtral coalition result already shows singleton localization is fragile; the same mechanism could explain the Qwen3 finding without proving expert-specific causal storage. The abstract gives no quantitative statistics, error bars, or explicit controls for downstream routing effects, which leaves the strength of the localization claims hard to judge.

This is for the mechanistic interpretability subgroup focused on efficient LLMs. It deserves peer review because the question is timely and the protocol is clearly stated, even if the routing confound needs tighter checks in revision.

Referee Report

2 major / 0 minor

Summary. The paper introduces expert-aware causal tracing for sparse MoE language models. It corrupts subject-token embeddings on CounterFact facts to disrupt factual recall, then tests whether restoring clean MoE-block outputs or individual expert updates recovers the true-vs-foil logit contrast. On Qwen3-30B-A3B-Base, layer 44 is selected and expert L44E069 is identified as repeatedly selected in clean runs whose held-out patch outperforms other same-layer experts; on Mixtral-8x7B-v0.1 the signal requires routed multi-expert updates rather than a singleton expert. The results indicate that MoE factual tracing can be made expert-aware but that expert-level localization is model- and protocol-dependent.

Significance. If the localization results hold under tighter controls, the work extends causal-tracing methodology from dense transformers to routed MoE architectures and supplies concrete evidence that factual recall can be isolated to individual experts in at least one model family. The intervention protocol is falsifiable and directly comparable to prior dense-model studies; the model-dependent outcome (singleton vs. coalition) is itself a substantive finding that could guide future editing and interpretability work on MoE systems.

major comments (2)

[Abstract] Abstract and method description: the protocol corrupts subject-token embeddings, which necessarily perturbs router logits at every subsequent layer. Patching only the output of one selected expert (e.g., L44E069) while leaving the corrupted routing decisions in place therefore risks restoring the logit contrast by correcting a routing mismatch or by generic residual-stream effects rather than by restoring expert-specific factual storage. The Mixtral result already shows singleton localization is fragile; the same mechanism could explain the Qwen singleton result without establishing expert-level causal storage.
[Abstract] Abstract: the manuscript reports qualitative localization differences but supplies no quantitative statistics, error bars, success rates across the CounterFact set, or validation details on the number of facts or runs. This absence makes it impossible to judge the reliability or effect size of the claim that L44E069 outperforms other active experts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the two major comments point by point below, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract and method description: the protocol corrupts subject-token embeddings, which necessarily perturbs router logits at every subsequent layer. Patching only the output of one selected expert (e.g., L44E069) while leaving the corrupted routing decisions in place therefore risks restoring the logit contrast by correcting a routing mismatch or by generic residual-stream effects rather than by restoring expert-specific factual storage. The Mixtral result already shows singleton localization is fragile; the same mechanism could explain the Qwen singleton result without establishing expert-level causal storage.

Authors: We agree this is an important caveat for causal interpretation. The protocol deliberately patches expert outputs while routing remains driven by the corrupted subject embeddings, mirroring standard causal-tracing practice in dense models. Nevertheless, the concern about routing mismatch or residual effects is valid, particularly given the Mixtral coalition result. In revision we will add explicit controls (e.g., router-logit patching and non-expert residual patching) and expand the discussion of model- and protocol-dependence to make the limitations clearer. revision: partial
Referee: [Abstract] Abstract: the manuscript reports qualitative localization differences but supplies no quantitative statistics, error bars, success rates across the CounterFact set, or validation details on the number of facts or runs. This absence makes it impossible to judge the reliability or effect size of the claim that L44E069 outperforms other active experts.

Authors: We accept that the abstract is currently qualitative and that quantitative support is needed for assessing reliability. The full manuscript contains experimental details on fact counts and comparisons, but these are not summarized in the abstract. We will revise the abstract to report key quantitative metrics (number of facts, success rates, effect sizes, and error bars from repeated runs) and ensure the main text provides the corresponding statistics and validation details. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical intervention protocol with no derivation chain

full rationale

The paper describes an experimental protocol for expert-aware causal tracing: corrupt subject-token embeddings, then patch clean MoE block outputs or expert-level updates and measure restoration of logit contrast on CounterFact. This is a direct intervention method with no equations deriving predictions from fitted parameters, no self-definitional loops, and no load-bearing self-citations that reduce the central claim to prior author work. Results are reported as empirical outcomes (e.g., L44E069 outperforming other patches) rather than constructed predictions. The method extends existing causal tracing techniques but remains self-contained against external benchmarks without renaming known results or smuggling ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical intervention study; no free parameters, mathematical axioms, or invented entities are introduced or required by the described protocol.

pith-pipeline@v0.9.1-grok · 5766 in / 1015 out tokens · 16064 ms · 2026-06-28T10:17:13.041252+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Locating and Editing Factual Associations in GPT , url =

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle =. Locating and Editing Factual Associations in GPT , url =
[2]

Dissecting Recall of Factual Associations in Auto-Regressive Language Models

Geva, Mor and Bastings, Jasmijn and Filippova, Katja and Globerson, Amir. Dissecting Recall of Factual Associations in Auto-Regressive Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.751

work page doi:10.18653/v1/2023.emnlp-main.751 2023
[3]

International Conference on Learning Representations , year=

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author=. International Conference on Learning Representations , year=
[4]

Journal of Machine Learning Research , year =

William Fedus and Barret Zoph and Noam Shazeer , title =. Journal of Machine Learning Research , year =
[5]

2024 , eprint=

Mixtral of Experts , author=. 2024 , eprint=

2024
[6]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[7]

Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =

Vig, Jesse and Gehrmann, Sebastian and Belinkov, Yonatan and Qian, Sharon and Nevo, Daniel and Singer, Yaron and Shieber, Stuart , booktitle =. Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =
[8]

Interpretability in the Wild: a Circuit for Indirect Object Identification in

Kevin Ro Wang and Alexandre Variengien and Arthur Conmy and Buck Shlegeris and Jacob Steinhardt , booktitle=. Interpretability in the Wild: a Circuit for Indirect Object Identification in. 2023 , url=

2023
[9]

Towards Automated Circuit Discovery for Mechanistic Interpretability , url =

Conmy, Arthur and Mavor-Parker, Augustine and Lynch, Aengus and Heimersheim, Stefan and Garriga-Alonso, Adri\`. Towards Automated Circuit Discovery for Mechanistic Interpretability , url =. Advances in Neural Information Processing Systems , editor =
[10]

Language Models as Knowledge Bases?

Petroni, Fabio and Rockt. Language Models as Knowledge Bases?. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1250

work page doi:10.18653/v1/d19-1250 2019
[11]

Ghandeharioun, A., Caciularu, A., Pearce, A., Dixon, L., and Geva, M

Geva, Mor and Schuster, Roei and Berant, Jonathan and Levy, Omer. Transformer Feed-Forward Layers Are Key-Value Memories. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.446

work page internal anchor Pith review doi:10.18653/v1/2021.emnlp-main.446 2021
[12]

Knowledge Neurons in Pretrained Transformers

Dai, Damai and Dong, Li and Hao, Yaru and Sui, Zhifang and Chang, Baobao and Wei, Furu. Knowledge Neurons in Pretrained Transformers. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.581

work page doi:10.18653/v1/2022.acl-long.581 2022
[13]

2021 , url=

Dmitry Lepikhin and HyoukJoong Lee and Yuanzhong Xu and Dehao Chen and Orhan Firat and Yanping Huang and Maxim Krikun and Noam Shazeer and Zhifeng Chen , booktitle=. 2021 , url=

2021
[14]

2022 , editor =

Du, Nan and Huang, Yanping and Dai, Andrew M and Tong, Simon and Lepikhin, Dmitry and Xu, Yuanzhong and Krikun, Maxim and Zhou, Yanqi and Yu, Adams Wei and Firat, Orhan and Zoph, Barret and Fedus, Liam and Bosma, Maarten P and Zhou, Zongwei and Wang, Tao and Wang, Emma and Webster, Kellie and Pellat, Marie and Robinson, Kevin and Meier-Hellstern, Kathleen...

2022
[15]

The Fourteenth International Conference on Learning Representations , year=

Multilingual Routing in Mixture-of-Experts , author=. The Fourteenth International Conference on Learning Representations , year=
[16]

2026 , eprint=

The Myth of Expert Specialization in MoEs: Why Routing Reflects Geometry, Not Necessarily Domain Expertise , author=. 2026 , eprint=

2026
[17]

Decoding Knowledge Attribution in Mixture-of-Experts: A Framework of Basic-Refinement Collaboration and Efficiency Analysis

Li, Junzhuo and Wang, Bo and Zhou, Xiuze and Jiang, Peijie and Liu, Jia and Hu, Xuming. Decoding Knowledge Attribution in Mixture-of-Experts: A Framework of Basic-Refinement Collaboration and Efficiency Analysis. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-...

work page doi:10.18653/v1/2025.acl-long.1093 2025
[18]

Yupu Gu and Rongzhe Wei and Andy Zhu and Pan Li , booktitle=. Mo. 2026 , url=

2026
[19]

Unveiling Factual Recall Behaviors of Large Language Models through Knowledge Neurons

Wang, Yifei and Chen, Yuheng and Wen, Wanting and Sheng, Yu and Li, Linjing and Zeng, Daniel Dajun. Unveiling Factual Recall Behaviors of Large Language Models through Knowledge Neurons. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.420

work page doi:10.18653/v1/2024.emnlp-main.420 2024
[20]

On Relation-Specific Neurons in Large Language Models

Liu, Yihong and Chen, Runsheng and Hirlimann, Lea and Hakimi, Ahmad Dawar and Wang, Mingyang and Kargaran, Amir Hossein and Rothe, Sascha and Yvon, Fran c ois and Schuetze, Hinrich. On Relation-Specific Neurons in Large Language Models. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.52

work page doi:10.18653/v1/2025.emnlp-main.52 2025
[21]

Rossi and Trung Bui and Hinrich Schuetze and Nanyun Peng , booktitle=

Mohsen Fayyaz and Ali Modarressi and Hanieh Deilamsalehy and Franck Dernoncourt and Ryan A. Rossi and Trung Bui and Hinrich Schuetze and Nanyun Peng , booktitle=. Steering MoE. 2026 , url=

2026
[22]

2026 , month = apr, howpublished =

Introducing GPT-5.5 , author =. 2026 , month = apr, howpublished =

2026

[1] [1]

Locating and Editing Factual Associations in GPT , url =

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle =. Locating and Editing Factual Associations in GPT , url =

[2] [2]

Dissecting Recall of Factual Associations in Auto-Regressive Language Models

Geva, Mor and Bastings, Jasmijn and Filippova, Katja and Globerson, Amir. Dissecting Recall of Factual Associations in Auto-Regressive Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.751

work page doi:10.18653/v1/2023.emnlp-main.751 2023

[3] [3]

International Conference on Learning Representations , year=

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author=. International Conference on Learning Representations , year=

[4] [4]

Journal of Machine Learning Research , year =

William Fedus and Barret Zoph and Noam Shazeer , title =. Journal of Machine Learning Research , year =

[5] [5]

2024 , eprint=

Mixtral of Experts , author=. 2024 , eprint=

2024

[6] [6]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[7] [7]

Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =

Vig, Jesse and Gehrmann, Sebastian and Belinkov, Yonatan and Qian, Sharon and Nevo, Daniel and Singer, Yaron and Shieber, Stuart , booktitle =. Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =

[8] [8]

Interpretability in the Wild: a Circuit for Indirect Object Identification in

Kevin Ro Wang and Alexandre Variengien and Arthur Conmy and Buck Shlegeris and Jacob Steinhardt , booktitle=. Interpretability in the Wild: a Circuit for Indirect Object Identification in. 2023 , url=

2023

[9] [9]

Towards Automated Circuit Discovery for Mechanistic Interpretability , url =

Conmy, Arthur and Mavor-Parker, Augustine and Lynch, Aengus and Heimersheim, Stefan and Garriga-Alonso, Adri\`. Towards Automated Circuit Discovery for Mechanistic Interpretability , url =. Advances in Neural Information Processing Systems , editor =

[10] [10]

Language Models as Knowledge Bases?

Petroni, Fabio and Rockt. Language Models as Knowledge Bases?. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1250

work page doi:10.18653/v1/d19-1250 2019

[11] [11]

Ghandeharioun, A., Caciularu, A., Pearce, A., Dixon, L., and Geva, M

Geva, Mor and Schuster, Roei and Berant, Jonathan and Levy, Omer. Transformer Feed-Forward Layers Are Key-Value Memories. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.446

work page internal anchor Pith review doi:10.18653/v1/2021.emnlp-main.446 2021

[12] [12]

Knowledge Neurons in Pretrained Transformers

Dai, Damai and Dong, Li and Hao, Yaru and Sui, Zhifang and Chang, Baobao and Wei, Furu. Knowledge Neurons in Pretrained Transformers. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.581

work page doi:10.18653/v1/2022.acl-long.581 2022

[13] [13]

2021 , url=

Dmitry Lepikhin and HyoukJoong Lee and Yuanzhong Xu and Dehao Chen and Orhan Firat and Yanping Huang and Maxim Krikun and Noam Shazeer and Zhifeng Chen , booktitle=. 2021 , url=

2021

[14] [14]

2022 , editor =

Du, Nan and Huang, Yanping and Dai, Andrew M and Tong, Simon and Lepikhin, Dmitry and Xu, Yuanzhong and Krikun, Maxim and Zhou, Yanqi and Yu, Adams Wei and Firat, Orhan and Zoph, Barret and Fedus, Liam and Bosma, Maarten P and Zhou, Zongwei and Wang, Tao and Wang, Emma and Webster, Kellie and Pellat, Marie and Robinson, Kevin and Meier-Hellstern, Kathleen...

2022

[15] [15]

The Fourteenth International Conference on Learning Representations , year=

Multilingual Routing in Mixture-of-Experts , author=. The Fourteenth International Conference on Learning Representations , year=

[16] [16]

2026 , eprint=

The Myth of Expert Specialization in MoEs: Why Routing Reflects Geometry, Not Necessarily Domain Expertise , author=. 2026 , eprint=

2026

[17] [17]

Decoding Knowledge Attribution in Mixture-of-Experts: A Framework of Basic-Refinement Collaboration and Efficiency Analysis

Li, Junzhuo and Wang, Bo and Zhou, Xiuze and Jiang, Peijie and Liu, Jia and Hu, Xuming. Decoding Knowledge Attribution in Mixture-of-Experts: A Framework of Basic-Refinement Collaboration and Efficiency Analysis. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-...

work page doi:10.18653/v1/2025.acl-long.1093 2025

[18] [18]

Yupu Gu and Rongzhe Wei and Andy Zhu and Pan Li , booktitle=. Mo. 2026 , url=

2026

[19] [19]

Unveiling Factual Recall Behaviors of Large Language Models through Knowledge Neurons

Wang, Yifei and Chen, Yuheng and Wen, Wanting and Sheng, Yu and Li, Linjing and Zeng, Daniel Dajun. Unveiling Factual Recall Behaviors of Large Language Models through Knowledge Neurons. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.420

work page doi:10.18653/v1/2024.emnlp-main.420 2024

[20] [20]

On Relation-Specific Neurons in Large Language Models

Liu, Yihong and Chen, Runsheng and Hirlimann, Lea and Hakimi, Ahmad Dawar and Wang, Mingyang and Kargaran, Amir Hossein and Rothe, Sascha and Yvon, Fran c ois and Schuetze, Hinrich. On Relation-Specific Neurons in Large Language Models. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.52

work page doi:10.18653/v1/2025.emnlp-main.52 2025

[21] [21]

Rossi and Trung Bui and Hinrich Schuetze and Nanyun Peng , booktitle=

Mohsen Fayyaz and Ali Modarressi and Hanieh Deilamsalehy and Franck Dernoncourt and Ryan A. Rossi and Trung Bui and Hinrich Schuetze and Nanyun Peng , booktitle=. Steering MoE. 2026 , url=

2026

[22] [22]

2026 , month = apr, howpublished =

Introducing GPT-5.5 , author =. 2026 , month = apr, howpublished =

2026