arxiv: 2605.06076 · v1 · submitted 2026-05-07 · 💻 cs.CL

Recognition: unknown

Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

Hang Chen, Hongxu Liu, Hongyang Chen, Jiaying Zhu, Wenya Wang, Xinyu Yang

Pith reviewed 2026-05-08 10:39 UTC · model grok-4.3

classification 💻 cs.CL

keywords mechanistic interpretabilityLLM fine-tuningcircuit evolutionlocate-then-updateTransformer circuitstemporal latencypost-training dynamics

0 comments

The pith

Static mechanisms extracted from current LLM parameters suffer temporal latency and cannot reliably guide future updates during fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the locate-then-update approach that dominates LLM post-training by tracking how circuits in Transformer models actually change across supervised fine-tuning steps. It introduces three metrics to measure neural migration, semantic stability, and cross-task interference, showing that circuits undergo free evolution rather than staying fixed. Because of this movement, any mechanism identified at one moment becomes outdated for later parameter edits. Readers care because the common practice of finding critical components once and then updating them rests on an unexamined assumption that those components remain stable.

Core claim

By systematically tracking the structural evolution of Transformer circuits throughout supervised fine-tuning, the authors reveal that circuits exhibit free evolution. This leads to the conclusion that static mechanisms extracted from current states inevitably suffer from temporal latency and are fundamentally inadequate for guiding future states.

What carries the argument

Three new metrics—Circuit Distance, Circuit Stability, and Circuit Conflict—that quantify neural migration, semantic stability, and cross-task interference to reveal the free evolution of circuits during parameter updates.

If this is right

Locate-then-update methods lose effectiveness as training progresses because circuits drift from their initial locations.
The apparent success of existing mechanistic localization techniques partly reflects an illusion created by short evaluation windows that do not capture ongoing evolution.
Mechanistic interventions in LLMs will require predictive models of circuit change rather than one-time snapshots from the current state.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers may need to interleave fresh localization steps with training rather than performing it only at the beginning.
This pattern could explain why some editing techniques degrade on longer fine-tunes even when they succeed on short ones.
A natural extension would be to test whether forecasting circuit trajectories in advance improves the precision of targeted updates.

Load-bearing premise

The new metrics accurately capture the mechanistically relevant changes in circuits, and the observed free evolution generalizes beyond the specific models and tasks examined.

What would settle it

Repeating the localization step at multiple points during fine-tuning and finding no improvement in edit success or task performance compared with using the initial static localization.

Figures

Figures reproduced from arXiv: 2605.06076 by Hang Chen, Hongxu Liu, Hongyang Chen, Jiaying Zhu, Wenya Wang, Xinyu Yang.

**Figure 1.** Figure 1: Differences in mechanism localization in post view at source ↗

**Figure 2.** Figure 2: line plots of different target tasks on the Mistral-7B model in terms of Circuit Distance view at source ↗

**Figure 3.** Figure 3: Target Task Accuracy, Pervasiveness Task Accuracy, and Circuit Conflict of Arithmetic view at source ↗

**Figure 4.** Figure 4: Line plots of Future-Localization view at source ↗

**Figure 5.** Figure 5: line plots of different target tasks on the Mistral-7B model and LlaMA3-8B model in terms view at source ↗

**Figure 6.** Figure 6: Target Task Accuracy (T-Acc), Pervasiveness Task Accuracy (P-Acc), and Circuit Conflict view at source ↗

**Figure 7.** Figure 7: Circuit Distance (CD) and Circuit Stability (CS) of Arithmetic, Bool, Gender, Winogrande, and SST-2 Task with localization. 23 view at source ↗

**Figure 8.** Figure 8: Line plots of Future Mechanistic Localization view at source ↗

read the original abstract

The "Locate-then-Update" paradigm has become a predominant approach in the post-training of large language models (LLMs), identifying critical components via mechanistic interpretability for targeted parameter updates. However, this paradigm rests on a fundamental yet unverified assumption: can mechanisms derived from current static parameters reliably guide future dynamic parameter updates? To investigate this, we systematically track the structural evolution of Transformer circuits throughout the supervised fine-tuning (SFT) process, revealing the underlying dynamics of task mechanisms. We introduce three novel metrics-Circuit Distance, Circuit Stability, and Circuit Conflict-to analyze circuit evolution across three dimensions: neural migration, semantic stability, and cross-task interference. Our empirical results reveal that circuits inherently exhibit "Free Evolution" during parameter updates. Consequently, static mechanisms extracted from current states inevitably suffer from temporal latency, making them fundamentally inadequate for guiding future states. Moreover, by deconstructing the "illusion of effectiveness" in existing methods, this work underscores the necessity of "foresight" in mechanistic localization and proposes a predictive framework for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper tracks circuit changes during fine-tuning with three new metrics and argues static localization is too slow, but the metrics lack causal checks so the strong claim does not yet land.

read the letter

The core observation is that circuits shift during supervised fine-tuning, and the authors supply three metrics to quantify neural migration, semantic stability, and task interference. They label the shifts free evolution and conclude that any mechanism found at one training step will be outdated for the next, so locate-then-update methods carry built-in latency. That framing is the main takeaway a colleague should carry away. The empirical tracking itself is the clearest new piece. They follow the same circuits across multiple SFT steps on concrete tasks and report consistent structural drift, which prior work on static circuit discovery had not measured directly. The metrics give a practical way to talk about how much a circuit has moved, and the cross-task conflict angle is a useful addition for anyone thinking about multi-task editing. The limitation is that the metrics are introduced without head-to-head comparison to activation patching or causal tracing. It is possible the distance and stability scores largely reflect parameter noise rather than the components that actually control output behavior. If that is the case, the numerical evolution does not automatically prove that current-state circuits fail to guide future updates. The paper shows the numbers move; it does not yet show that the movement breaks editing performance in a controlled test. Readers already working on mechanistic localization or model editing will find the metrics worth trying on their own runs. Someone outside that niche will mostly see another reminder that circuits are not fixed. The work is coherent on its own terms and engages the right literature, so it clears the bar for peer review. A referee could usefully press for causal validation of the metrics and a direct editing experiment that measures whether using an earlier circuit actually hurts downstream performance. I would send it out.

Referee Report

2 major / 2 minor

Summary. The paper argues that the dominant 'Locate-then-Update' paradigm in LLM post-training is fundamentally limited because static mechanistic localizations extracted from current model states cannot reliably guide future parameter updates. It supports this by tracking Transformer circuit evolution during supervised fine-tuning (SFT), introducing three new metrics (Circuit Distance for neural migration, Circuit Stability for semantic consistency, and Circuit Conflict for cross-task interference) that reveal 'Free Evolution'—ongoing structural drift independent of the localization process. The work deconstructs the apparent success of existing methods as illusory and calls for predictive, foresight-based localization frameworks.

Significance. If the empirical observations hold, the paper identifies a previously under-examined temporal mismatch in mechanistic interpretability for post-training, with potential to redirect research from static circuit discovery toward dynamic or predictive approaches. The explicit tracking of circuit changes across training steps and the introduction of evolution-specific metrics constitute a concrete empirical contribution that could be built upon, provided the metrics are shown to track causally relevant mechanisms rather than incidental parameter shifts.

major comments (2)

[Abstract / metrics introduction] Abstract and metrics section: The central claim that static mechanisms 'inevitably suffer from temporal latency' and are 'fundamentally inadequate' rests on the three new metrics demonstrating 'Free Evolution.' However, the manuscript provides no ablation or comparison against established causal tools (activation patching, causal tracing, or path patching) to establish that Circuit Distance, Stability, and Conflict track task-relevant mechanisms rather than non-causal parameter noise. Without such validation, the inference from observed numerical drift to fundamental inadequacy of the locate-then-update paradigm does not follow.
[Empirical results] Empirical results section: The reported structural drift is presented as generalizing beyond the specific models and tasks studied, yet the manuscript does not include controls for whether the observed evolution correlates with downstream task performance degradation or with changes in causal importance of the localized components. This is load-bearing for the 'free evolution' generalization.

minor comments (2)

[Introduction] The abstract and introduction would benefit from explicit citation of prior work on circuit evolution or dynamic interpretability (e.g., studies tracking attention heads or MLP circuits across training checkpoints) to better situate the novelty of the proposed metrics.
[Methods] Notation for the three metrics should be formalized with equations or pseudocode in the methods section to allow reproducibility; currently the descriptions remain high-level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which identifies key areas for strengthening the validation of our metrics and the generalization of our findings. We address each major comment below and will incorporate revisions to enhance the manuscript.

read point-by-point responses

Referee: Abstract / metrics introduction: The central claim that static mechanisms 'inevitably suffer from temporal latency' and are 'fundamentally inadequate' rests on the three new metrics demonstrating 'Free Evolution.' However, the manuscript provides no ablation or comparison against established causal tools (activation patching, causal tracing, or path patching) to establish that Circuit Distance, Stability, and Conflict track task-relevant mechanisms rather than non-causal parameter noise. Without such validation, the inference from observed numerical drift to fundamental inadequacy of the locate-then-update paradigm does not follow.

Authors: We agree that direct validation against causal intervention methods is necessary to confirm our metrics track task-relevant mechanisms. The metrics were constructed to measure structural properties of circuits localized via standard interpretability techniques, with free evolution observed as consistent drift across training steps independent of the localization process. To address the concern, we will revise the paper by adding ablation experiments that compare metric values against activation patching and path patching outcomes, verifying alignment between numerical drift and changes in causal effects. These will be included in an expanded metrics validation subsection. revision: yes
Referee: Empirical results: The reported structural drift is presented as generalizing beyond the specific models and tasks studied, yet the manuscript does not include controls for whether the observed evolution correlates with downstream task performance degradation or with changes in causal importance of the localized components. This is load-bearing for the 'free evolution' generalization.

Authors: We acknowledge that explicit controls correlating drift with performance degradation and causal importance would bolster the generalization claim. Our current results demonstrate consistent free evolution across multiple models and tasks, supporting the temporal latency issue. In revision, we will add analyses correlating Circuit Distance and Stability with performance drops when applying static localizations to later training states, plus causal importance checks via patching to show that evolved components retain relevance. This will be integrated into the empirical results section to directly address the load-bearing requirement. revision: yes

Circularity Check

0 steps flagged

No circularity: central claim follows from direct empirical tracking of circuit evolution

full rationale

The paper's argument rests on systematic empirical tracking of Transformer circuit structural changes across SFT steps, using three newly introduced metrics (Circuit Distance, Circuit Stability, Circuit Conflict) to document 'Free Evolution'. The conclusion that static mechanisms suffer temporal latency is presented as a direct inference from these observations rather than any derivation, fitted parameter, or self-referential definition. No equations reduce a 'prediction' to an input by construction, and the provided text contains no load-bearing self-citations or uniqueness theorems imported from prior author work. The analysis is self-contained against external benchmarks via its empirical methodology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the validity of the three new metrics as faithful measures of mechanistic change and on the interpretation that observed circuit shifts constitute general free evolution independent of specific update targets.

axioms (2)

domain assumption Mechanistic interpretability techniques can reliably identify and track task-relevant circuits across training steps
The locate-then-update paradigm and the tracking analysis both presuppose that circuits remain identifiable and meaningful entities during SFT.
ad hoc to paper The proposed metrics (Circuit Distance, Stability, Conflict) capture the dimensions relevant to localization effectiveness
These are novel metrics introduced in the paper without reference to prior validation or theoretical grounding.

invented entities (1)

Free Evolution no independent evidence
purpose: To characterize the inherent, ongoing structural change in circuits during parameter updates
New descriptive term introduced to summarize the observed dynamics; no independent falsifiable prediction is provided.

pith-pipeline@v0.9.0 · 5496 in / 1356 out tokens · 74508 ms · 2026-05-08T10:39:28.450362+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 9 canonical work pages · 3 internal anchors

[1]

A survey of post-training scaling in large language models

Hanyu Lai, Xiao Liu, Junjie Gao, Jiale Cheng, Zehan Qi, Yifan Xu, Shuntian Yao, Dan Zhang, Jinhua Du, Zhenyu Hou, et al. A survey of post-training scaling in large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2771–2791, 2025

2025
[2]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pages 38087–38099. PMLR, 2023

2023
[3]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

2022
[4]

Teaching large language models to reason with reinforcement learning

Alexander Havrilla, Yuqing Du, Sharath Chandra Raparthy, Christoforos Nalmpantis, Jane Dwivedi-Yu, Eric Hambro, Sainbayar Sukhbaatar, and Roberta Raileanu. Teaching large language models to reason with reinforcement learning. InAI for Math Workshop @ ICML 2024, 2024

2024
[5]

Editing large language models: Problems, methods, and opportunities

Yunzhi Yao, Peng Wang, Bozhong Tian, Siyuan Cheng, Zhoubo Li, Shumin Deng, Huajun Chen, and Ningyu Zhang. Editing large language models: Problems, methods, and opportunities. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10222–10240, 2023

2023
[6]

Yuanpu Cao, Tianrong Zhang, Bochuan Cao, Ziyi Yin, Lu Lin, Fenglong Ma, and Jinghui Chen. Personalized steering of large language models: Versatile steering vectors through bi-directional preference optimization.Advances in Neural Information Processing Systems, 37:49519–49551, 2024

2024
[7]

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Hengyuan Zhang, Zhihao Zhang, Mingyang Wang, Zunhai Su, Yiwei Wang, Qianli Wang, Shuzhou Yuan, Ercong Nie, Xufeng Duan, Qibo Xue, et al. Locate, steer, and improve: A practical survey of actionable mechanistic interpretability in large language models.arXiv preprint arXiv:2601.14004, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Depn: Detecting and editing privacy neurons in pretrained language models

Xinwei Wu, Junzhuo Li, Minghui Xu, Weilong Dong, Shuangzhi Wu, Chao Bian, and Deyi Xiong. Depn: Detecting and editing privacy neurons in pretrained language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2875–2886, 2023

2023
[9]

Effective skill unlearning through intervention and abstention

Yongce Li, Chung-En Sun, and Tsui-Wei Weng. Effective skill unlearning through intervention and abstention. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6358–6371, 2025

2025
[10]

Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

2022
[11]

Knowledge neurons in pretrained transformers

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493–8502, 2022

2022
[12]

Spurious rewards paradox: Mechanistically understanding how rlvr activates memorization shortcuts in llms.arXiv preprint arXiv:2601.11061, 2026

Lecheng Yan, Ruizhe Li, Guanhua Chen, Qing Li, Jiahui Geng, Wenxi Li, Vincent Wang, and Chris Lee. Spurious rewards paradox: Mechanistically understanding how rlvr activates memorization shortcuts in llms.arXiv preprint arXiv:2601.11061, 2026

work page arXiv 2026
[13]

Rethinking circuit completeness in language models: And, or, and adder gates

Hang Chen, Jiaying Zhu, Xinyu Yang, and Wenya Wang. Rethinking circuit completeness in language models: And, or, and adder gates. In D. Belgrave, C. Zhang, H. Lin, L. Montoya, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 150511–150540. Curran Associates, Inc., 2025

2025
[14]

Does localization inform editing? surprising differences in causality-based localization vs

Peter Hase, Mohit Bansal, Been Kim, and Asma Ghandeharioun. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models.Advances in Neural Information Processing Systems, 36:17643–17668, 2023. 11

2023
[15]

Towards automated circuit discovery for mechanistic interpretability.Advances in Neural Information Processing Systems, 36:16318–16352, 2023

Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability.Advances in Neural Information Processing Systems, 36:16318–16352, 2023

2023
[16]

Attribution patching outperforms automated circuit discovery

Aaquib Syed, Can Rager, and Arthur Conmy. Attribution patching outperforms automated circuit discovery. InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 407–416, 2024

2024
[17]

Rethinking machine unlearning for large language models.Nature Machine Intelligence, 7(2):181–194, 2025

Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Yuguang Yao, Chris Yuhao Liu, Xiaojun Xu, Hang Li, et al. Rethinking machine unlearning for large language models.Nature Machine Intelligence, 7(2):181–194, 2025

2025
[18]

Knowledge editing for large language models: A survey.ACM Computing Surveys, 57(3):1–37, 2024

Song Wang, Yaochen Zhu, Haochen Liu, Zaiyi Zheng, Chen Chen, and Jundong Li. Knowledge editing for large language models: A survey.ACM Computing Surveys, 57(3):1–37, 2024

2024
[19]

Visualizing and understanding neural models in nlp

Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. Visualizing and understanding neural models in nlp. InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 681–691, 2016

2016
[20]

Language-specific neurons: The key to multilingual capabilities in large language models

Tianyi Tang, Wenyang Luo, Haoyang Huang, Dongdong Zhang, Xiaolei Wang, Wayne Xin Zhao, Furu Wei, and Ji-Rong Wen. Language-specific neurons: The key to multilingual capabilities in large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5701–5715, 2024

2024
[21]

A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis

Alessandro Stolfo, Yonatan Belinkov, and Mrinmaya Sachan. A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7035–7052, 2023

2023
[22]

How large language models encode context knowledge? a layer-wise probing study

Tianjie Ju, Weiwei Sun, Wei Du, Xinwei Yuan, Zhaochun Ren, and Gongshen Liu. How large language models encode context knowledge? a layer-wise probing study. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 8235–8246, 2024

2024
[23]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens.arXiv preprint arXiv:2303.08112, 2023

work page internal anchor Pith review arXiv 2023
[24]

A mathematical framework for transformer circuits.Transformer Circuits Thread,

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...
[25]

https://transformer-circuits.pub/2021/framework/index.html

2021
[26]

A practical review of mechanistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646,

Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, and Ziyu Yao. A practical review of mech- anistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646, 2024

work page arXiv 2024
[27]

Wagle: Strategic weight attribution for effective and modular unlearning in large language models.Advances in Neural Information Processing Systems, 37:55620–55646, 2024

Jinghan Jia, Jiancheng Liu, Yihua Zhang, Parikshit Ram, Nathalie Baracaldo, and Sijia Liu. Wagle: Strategic weight attribution for effective and modular unlearning in large language models.Advances in Neural Information Processing Systems, 37:55620–55646, 2024

2024
[28]

Saes can improve unlearning: Dynamic sparse autoencoder guardrails for precision unlearning in llms

Aashiq Muhamed, Jacopo Bonato, Mona T Diab, and Virginia Smith. Saes can improve unlearning: Dynamic sparse autoencoder guardrails for precision unlearning in llms. InSecond Conference on Language Modeling, 2025

2025
[29]

Continual learning and private unlearning

Bo Liu, Qiang Liu, and Peter Stone. Continual learning and private unlearning. InConference on Lifelong Learning Agents, pages 243–254. PMLR, 2022

2022
[30]

Tofu: A task of fictitious unlearning for llms

Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary Chase Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms. InFirst Conference on Language Modeling, 2024. 12

2024
[31]

Negative preference optimization: From catastrophic collapse to effective unlearning

Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning. InFirst Conference on Language Modeling, 2024

2024
[32]

How to use and interpret activation patching.arXiv preprint arXiv:2404.15255,

Stefan Heimersheim and Neel Nanda. How to use and interpret activation patching.arXiv preprint arXiv:2404.15255, 2024

work page arXiv 2024
[33]

Investigating gender bias in language models using causal mediation analysis.Advances in neural information processing systems, 33:12388–12401, 2020

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis.Advances in neural information processing systems, 33:12388–12401, 2020

2020
[34]

Causal scrub- bing, a method for rigorously testing interpretability hypotheses.AI Alignment Fo- rum, 2022

Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldwosky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, and Nate Thomas. Causal scrub- bing, a method for rigorously testing interpretability hypotheses.AI Alignment Fo- rum, 2022. https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/ causal-scrubbing-a-method-for-rigorously-testing

2022
[35]

Localizing Model Behavior with Path Patching , journal =

Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, and Aryaman Arora. Localizing model behavior with path patching.arXiv preprint arXiv:2304.05969, 2023

work page arXiv 2023
[36]

Circuit stability characterizes language model generalization

Alan Sun. Circuit stability characterizes language model generalization. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9025–9040, 2025

2025
[37]

CLUE: Conflict-guided localization for LLM unlearning framework

Hang Chen, Jiaying Zhu, Xinyu Yang, and Wenya Wang. CLUE: Conflict-guided localization for LLM unlearning framework. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[38]

Guiding high-performance sat solvers with unsat-core predictions

Daniel Selsam and Nikolaj Bjørner. Guiding high-performance sat solvers with unsat-core predictions. InInternational conference on theory and applications of satisfiability testing, pages 336–353. Springer, 2019

2019
[39]

A simple and flexible way of computing small unsatisfiable cores in sat modulo theories

Alessandro Cimatti, Alberto Griggio, and Roberto Sebastiani. A simple and flexible way of computing small unsatisfiable cores in sat modulo theories. InInternational Conference on Theory and Applications of Satisfiability Testing, pages 334–339. Springer, 2007

2007
[40]

Alloy+ hotcore: A fast approximation to unsat core

Nicolás D’Ippolito, Marcelo F Frias, Juan P Galeotti, Esteban Lanzarotti, and Sergio Mera. Alloy+ hotcore: A fast approximation to unsat core. InInternational Conference on Abstract State Machines, Alloy, B and Z, pages 160–173. Springer, 2010

2010
[41]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InEMNLP, 2018

2018
[42]

URL: https://itch

Chris Mathwin, Guillaume Corlouer, Esben Kran, Fazl Barez, and Neel Nanda. Identifying a preliminary circuit for predicting gendered pronouns in gpt-2 small.URL: https://itch. io/jam/mechint/rate/1889871, page 2, 2023

work page arXiv 2023
[43]

Springer Nature, 2022

Ido Dagan, Dan Roth, Fabio Zanzotto, and Mark Sammons.Recognizing textual entailment: Models and applications. Springer Nature, 2022

2022
[44]

Interpretability in the wild: a circuit for indirect object identification in gpt-2 small

Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. InThe Eleventh International Conference on Learning Representations
[45]

A circuit for python docstrings in a 4-layer attention-only transformer

Stefan Heimersheim and Jett Janiak. A circuit for python docstrings in a 4-layer attention-only transformer. InAlignment Forum, 2023

2023
[46]

Manning, Andrew Ng, and Christopher Potts

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA, October 2013. Association for Co...

2013
[47]

Winogrande: An adversarial winograd schema challenge at scale. 2019. 13

2019
[48]

Tracr: Compiled transformers as a laboratory for interpretability.Advances in Neural Information Processing Systems, 36:37876–37899, 2023

David Lindner, János Kramár, Sebastian Farquhar, Matthew Rahtz, Tom McGrath, and Vladimir Mikulik. Tracr: Compiled transformers as a laboratory for interpretability.Advances in Neural Information Processing Systems, 36:37876–37899, 2023

2023
[49]

How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model.Advances in Neural Information Processing Systems, 36:76033–76060, 2023

Michael Hanna, Ollie Liu, and Alexandre Variengien. How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model.Advances in Neural Information Processing Systems, 36:76033–76060, 2023

2023
[50]

FEVER: a large-scale dataset for fact extraction and VERification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER: a large-scale dataset for fact extraction and VERification. In Marilyn Walker, Heng Ji, and Amanda Stent, editors,Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long...

2018
[51]

Zero-shot relation extraction via reading comprehension

Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. Zero-shot relation extraction via reading comprehension. In Roger Levy and Lucia Specia, editors,Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 333–342, Vancouver, Canada, August 2017. Association for Computational Linguistics

2017
[52]

Challenging big-bench tasks and whether chain-of-thought can solve them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Compu- tational Linguistics: ACL 2023, pages 13003–13051, 2023

2023
[53]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

2020
[54]

Functional faithfulness in the wild: Circuit discovery with differentiable computation graph pruning.arXiv preprint arXiv:2407.03779, 2024

Lei Yu, Jingcheng Niu, Zining Zhu, and Gerald Penn. Functional faithfulness in the wild: Circuit discovery with differentiable computation graph pruning.arXiv preprint arXiv:2407.03779, 2024

work page arXiv 2024
[55]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, 2021

2021
[56]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review arXiv 2017
[57]

In-context learning and induction heads.Transformer Circuits Thread, 2022

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

2022
[58]

An information- theoretic parameter-free bayesian framework for probing labeled dependency trees from atten- tion score

Hongxu Liu, Jing Ma, Xiaojie Wang, Caixia Yuan, and Fangxiang Feng. An information- theoretic parameter-free bayesian framework for probing labeled dependency trees from atten- tion score. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[59]

knowledge

Vaidehi Patil, Peter Hase, and Mohit Bansal. Can sensitive information be deleted from llms? objectives for defending against extraction attacks. InThe Twelfth International Conference on Learning Representations, 2023. 14 A Details of Logical Circuit Framework At first, we systematically introduce three fundamental circuit logic types: theANDgate,ORgate,...

2023
[60]

Therefore, Mechanistic Localization for these knowledge-centric tasks genuinely imparts meaningful guidance for future parameter updates

MLP-dominated circuitsare significantly less prone to migration. Therefore, Mechanistic Localization for these knowledge-centric tasks genuinely imparts meaningful guidance for future parameter updates
[61]

foresight

Attention-dominated circuitsare highly susceptible to migration, leading to profound structural discrepancies across different parameter states. Consequently, Mechanistic Lo- calization for these skill-centric tasks suffers from severe temporal latency, rendering it ineffective for guiding dynamic updates. F Extended Analysis of Future Mechanistic Localiz...
[62]

Inherent Limitations of Circuit Discovery:The process of circuit discovery itself is notoriously difficult to scale to exceptionally large LLMs and imposes stringent requirements on data quality. Consequently, this computational bottleneck precludes further analysis under massive data and model scaling scenarios, thereby restricting the direct application...
[63]

Coupling of Localization and Parameter Update Mechanisms:Many contemporary Mechanistic Localization methodologies introduce bespoke parameter update techniques paired with their localization strategies; the effects of these two components are rarely strictly independent. Although employing standard SFT as our observational baseline allows us to capture un...