pith. machine review for the scientific record. sign in

arxiv: 2601.14004 · v4 · submitted 2026-01-20 · 💻 cs.CL

Recognition: no theorem link

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:37 UTC · model grok-4.3

classification 💻 cs.CL
keywords mechanistic interpretabilitylarge language modelsmodel steeringlocalization methodsmodel alignmentcapability enhancementintervention protocolactionable interpretability
0
0 comments X

The pith

A survey frames mechanistic interpretability as a locate-steer-improve pipeline that turns diagnostic findings into targeted interventions for LLM alignment, capability, and efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper organizes existing mechanistic interpretability work into a practical pipeline called Locate, Steer, and Improve. It groups localization methods and steering methods according to the specific interpretable objects they act on, such as circuits or representations, to create a repeatable intervention protocol. The authors show how following this protocol produces concrete gains in model alignment, task capability, and computational efficiency. By treating MI as an engineering tool rather than pure observation, the survey aims to make interpretability directly useful for optimizing deployed language models.

Core claim

Mechanistic Interpretability can be turned into an actionable methodology for model optimization by structuring work around a Locate-Steer-Improve pipeline. Localization and steering techniques are formally categorized by the interpretable objects they target, yielding a rigorous protocol whose application demonstrably improves alignment, capability, and efficiency in large language models.

What carries the argument

The Locate-Steer-Improve pipeline, which sequences diagnosis of interpretable objects in LLMs with targeted interventions to produce measurable model improvements.

If this is right

  • Practitioners can diagnose specific model behaviors by matching failure modes to the categorized localization methods.
  • Steering interventions become more precise because they are indexed to the same interpretable objects used in diagnosis.
  • Model updates for alignment or efficiency can be performed with less full retraining by reusing the locate-steer steps.
  • The same object-based protocol can be applied across multiple models to compare their internal structures directly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The pipeline could be tested on multimodal models to see whether the same object categories transfer beyond text.
  • If the categorization proves stable, it might reduce the need for separate interpretability research tracks focused only on analysis.
  • Adoption would require benchmark suites that measure intervention success rather than only interpretability metrics.

Load-bearing premise

Existing mechanistic interpretability methods can be systematically grouped by their interpretable objects into one unified protocol that reliably delivers measurable gains in alignment, capability, and efficiency.

What would settle it

A controlled test in which a standard suite of localization and steering techniques, when applied according to the object-based categorization, produces no statistically significant improvement on alignment or capability benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2601.14004 by Chaofan Tao, Chenming Shang, Dongdong Zhang, Ercong Nie, Feijiang Han, Hayden Kwok-Hay So, Hengyuan Zhang, Hinrich Sch\"utze, Hui Shen, Jing Xiong, Mingyang Wang, Ngai Wong, Qianli Wang, Qibo Xue, Qi Zhang, Ruobing Xie, Senjie Jin, Shuzhou Yuan, Sophia Ananiadou, Tao Gui, Xiao Liang, Xuanjing Huang, Xufeng Duan, Yiwei Wang, Zeping Yu, Zhengwu Liu, Zhihao Zhang, Zhiheng Xi, Zunhai Su.

Figure 1
Figure 1. Figure 1: Overview of the paper structure. We begin by defining the core interpretable objects (§2) that form the foundation of our analysis. We then introduce a range of methods, ranging from localization (§3) to steering(§4). Finally, we illustrate how these methods can be applied to improve models (§5). 1. Introduction Large Language Models (LLMs) have recently achieved remarkable success, demonstrating out￾stand… view at source ↗
Figure 2
Figure 2. Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: , an SAE acts as a “microscope” for the LLM. It projects low-dimensional dense activations into a higher-dimensional sparse latent space, effectively “unpacking” the superposition. ... Activation Function ... ... ... Reconstruction Sparsity SparseAutoencoder M o d el Size Stage Neurons Small Size LLMs GPT-2 Neuron Toy Model Transformer Model GPT-2 Pythia-70m GPT-4 Claude-3 LLaMa-3.1 Gemma-2 Close Source Op… view at source ↗
Figure 4
Figure 4. Figure 4: Localization via Magnitude Analysis. (a) [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of Causal Tracing. The method identifies critical internal states by creating a corrupted run (noising the subject “Space Needle”) and systematically restoring clean states to see which ones recover the prediction “Seattle”. The heatmap results reveal that factual information is processed in early MLP layers at the subject position and later transferred to the final token via attention. Based on t… view at source ↗
Figure 6
Figure 6. Figure 6: Neuron-level gradient-based localization for mitigating knowledge conflicts. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Layer-wise probing pipeline for context knowledge. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: (a) Projecting residual stream states reveals the layer-wise evolution of latent concepts, showing an English-centric bottleneck in multilingual settings (Wendler et al., 2024). (b) Projecting SAE decoder weights identifies the semantic meaning of sparse features (e.g., a “food” feature) by identifying top-ranked tokens Shu et al. (2025). Based on figures from Wendler et al. (2024) and Shu et al. (2025). e… view at source ↗
Figure 9
Figure 9. Figure 9: Knowledge circuit example. A sparse cross-layer circuit supporting the factual completion “The official language of France is French” in GPT-2-Medium. Left: A simplified circuit. Here, L15H0 means the first attention head in the 15th layer and MLP12 means the FFN block in the 13th layer. Right: Behavior of several special heads. The left matrix shows each head’s attention pattern, and the right heatmap sho… view at source ↗
Figure 10
Figure 10. Figure 10: Examples of Steering via Amplitude Manipulation. (a) Ablation for Language Steering: Tang et al. (2024) deactivate (zero out) “Chinese-specific neurons” to suppress model’s ability to generate Chinese, successfully forcing the model to switch its output to English. (b) Patching for Demographic Steering: Ahsan et al. (2025) inject a “Male Patch” into the model’s internal representation. This intervention n… view at source ↗
Figure 11
Figure 11. Figure 11: A representative Targeted Optimization pipeline. The method first identifies language-pair-relevant layers, then scores neuron language-awareness, and finally routes gradient updates to a small subset of language-aware neurons for selective fine-tuning. This illustrates how Targeted Optimization enforces locality via an object mask M in Eq. 20. Based on the figure from Zhu et al. (2024). Characteristics a… view at source ↗
Figure 12
Figure 12. Figure 12: The pipeline for steering LLMs using SAE features. (a) Steering Vector Extraction: The target steering vector is derived by analyzing a set of prompts to identify features that distinguish a concept-rich state z ′ from a neutral state z. The steering vector is computed as the weighted sum of these identified SAE features (i.e., decoder columns). (b) Steering LLM Behavior: This aggregated vector is injecte… view at source ↗
read the original abstract

Mechanistic Interpretability (MI) has emerged as a vital approach to demystify the opaque decision-making of Large Language Models (LLMs). However, existing reviews primarily treat MI as an observational science, summarizing analytical insights while lacking a systematic framework for actionable intervention. To bridge this gap, we present a practical survey structured around the pipeline: "Locate, Steer, and Improve." We formally categorize Localizing (diagnosis) and Steering (intervention) methods based on specific Interpretable Objects to establish a rigorous intervention protocol. Furthermore, we demonstrate how this framework enables tangible improvements in Alignment, Capability, and Efficiency, effectively operationalizing MI as an actionable methodology for model optimization. The curated paper list of this work is available at https://github.com/rattlesnakey/Awesome-Actionable-MI-Survey.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript presents a survey of mechanistic interpretability (MI) techniques for large language models, organized around a three-stage pipeline of 'Locate' (identifying interpretable features or circuits), 'Steer' (intervening on those features), and 'Improve' (applying the interventions to enhance model properties). It categorizes localization and steering methods according to specific 'Interpretable Objects' to create a unified intervention protocol. The authors demonstrate the framework's utility by reviewing how it facilitates improvements in alignment, capability, and efficiency, supported by a curated collection of papers available on GitHub.

Significance. If the proposed categorization holds, this work is significant in transforming MI from a primarily diagnostic tool into an actionable methodology for model development. By providing a structured pipeline and taxonomy based on interpretable objects, it offers practitioners a clear protocol for applying MI techniques. The inclusion of a publicly available curated list enhances the survey's value as a resource for the community, potentially accelerating research in alignment and optimization of LLMs. The explicit structure around Locate-Steer-Improve distinguishes it from prior observational surveys.

major comments (1)
  1. [§3] §3 (Categorization by Interpretable Objects): The claim that the taxonomy establishes a 'rigorous intervention protocol' is undercut by the absence of explicit decision criteria or decision tree for assigning methods to object categories (e.g., how a hybrid activation-weight method is classified); this makes the protocol less reproducible than asserted.
minor comments (3)
  1. [Abstract] Abstract: The phrase 'formally categorize' overstates the contribution; the categorization is a useful descriptive taxonomy rather than a formal mathematical framework.
  2. [§4.3] §4.3 (Efficiency improvements): The discussion of efficiency gains cites prior work but does not quantify trade-offs (e.g., steering overhead vs. inference savings); adding a summary table would strengthen the claim.
  3. [Introduction] GitHub link: The curated list is a strength, but the manuscript should include a brief description of inclusion/exclusion criteria used to build the list.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and recognition of the survey's potential to transform mechanistic interpretability into an actionable methodology. We address the single major comment below and will revise the manuscript accordingly to strengthen the presentation of the taxonomy.

read point-by-point responses
  1. Referee: [§3] §3 (Categorization by Interpretable Objects): The claim that the taxonomy establishes a 'rigorous intervention protocol' is undercut by the absence of explicit decision criteria or decision tree for assigning methods to object categories (e.g., how a hybrid activation-weight method is classified); this makes the protocol less reproducible than asserted.

    Authors: We agree that explicit decision criteria would enhance the reproducibility of the classification. The taxonomy in §3 groups methods according to the primary interpretable object they target (activations, weights, circuits, or representations), which forms the basis of the Locate-Steer-Improve protocol. For hybrid methods, the manuscript implicitly classifies them by the dominant intervention target, but we acknowledge this is not stated with sufficient precision. In the revised version, we will add a dedicated subsection (or appendix) containing a decision flowchart and explicit criteria: methods are assigned to the category matching their primary object, with hybrids noted as such and classified by the component that drives the intervention effect. This will make the protocol more rigorous without changing the core taxonomy. revision: yes

Circularity Check

0 steps flagged

No significant circularity: survey organizes external literature without derivations

full rationale

The manuscript is a survey paper that proposes an organizational taxonomy and pipeline ('Locate, Steer, and Improve') for existing mechanistic interpretability methods. It aggregates and categorizes prior work by 'Interpretable Objects' but contains no equations, fitted parameters, predictions, or novel derivations that could reduce to inputs by construction. All substantive claims rest on citations to external literature; the GitHub-curated list and three-stage structure are presented as curatorial contributions rather than self-derived results. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing premises. This is the standard non-circular outcome for a well-structured survey.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The survey rests on standard domain assumptions in mechanistic interpretability without introducing new free parameters or invented entities.

axioms (1)
  • domain assumption Mechanistic interpretability yields insights that can be turned into actionable interventions for LLMs
    This premise underpins the entire Locate-Steer-Improve structure and the claim that the framework enables tangible improvements.

pith-pipeline@v0.9.0 · 5557 in / 1170 out tokens · 56635 ms · 2026-05-16T12:37:27.469957+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

    cs.CL 2026-05 unverdicted novelty 7.0

    Transformer circuits show free evolution during SFT, rendering static mechanistic localization inadequate for future parameter updates due to inherent temporal latency.

  2. Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

    cs.LG 2026-04 unverdicted novelty 7.0

    The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

  3. Freeze Deep, Train Shallow: Interpretable Layer Allocation for Continued Pre-Training

    cs.CL 2026-05 unverdicted novelty 6.0

    Freezing deep layers and training shallow layers during continued pre-training of LLMs outperforms full fine-tuning and the opposite allocation on C-Eval and CMMLU, guided by a new layer-sensitivity diagnostic.

  4. From Attribution to Action: A Human-Centered Application of Activation Steering

    cs.AI 2026-04 unverdicted novelty 6.0

    Activation steering paired with attribution enables intervention-based debugging in vision models, as all 8 interviewed experts shifted to hypothesis testing, most trusted observed responses, and highlighted risks lik...

  5. Head-wise Modality Specialization within MLLMs for Robust Fake News Detection under Missing Modality

    cs.CV 2026-04 unverdicted novelty 5.0

    Head-wise modality specialization via attention constraints and unimodal knowledge retention in MLLMs improves robustness to missing modalities in fake news detection while preserving full multimodal performance.

  6. SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining

    cs.LG 2026-02 conditional novelty 5.0

    SnapMLA achieves up to 1.91x higher throughput in long-output MLA decoding using FP8 quantization and specialized kernels while keeping benchmark quality near the BF16 baseline.

  7. Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models

    cs.CL 2026-05 unverdicted novelty 4.0

    Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.

Reference graph

Works this paper leans on

293 extracted references · 293 canonical work pages · cited by 7 Pith papers · 22 internal anchors

  1. [1]

    Hiba Ahsan, Arnab Sen Sharma, Silvio Amir, David Bau, and Byron C. Wallace. Elucidating Mecha- nisms of Demographic Bias in LLMs for Healthcare. http://arxiv.org/abs/2502.13319, September 2025

  2. [2]

    Symbols of One-Loop Integrals From Mixed Tate Motives

    Mst. Shapna Akter, Hossain Shahriar, Alfredo Cuzzocrea, and Fan Wu. Uncovering the interpretation of large language models. In Hossain Shahriar, Hiroyuki Ohsaki, Moushumi Sharmin, Dave Towey, A. K. M. Jahangir Alam Majumder, Yoshiaki Hori, Ji-Jiang Yang, Michiharu Takemoto, Nazmus Sakib, Ryohei Banno, and Sheikh Iqbal Ahamed, editors,48th IEEE Annual Comp...

  3. [3]

    Understanding intermediate layers using linear classifier probes

    Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings. OpenReview.net, 2017. URLhttps://openreview.net/ forum?id=HJ4-rAVtl

  4. [4]

    Physics of language models: Part 1, learning hierarchical language structures.arXiv preprint arXiv:2305.13673, 2023

    Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 1, learning hierarchical language structures.arXiv preprint arXiv:2305.13673, 2023

  5. [5]

    Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivo...

  6. [6]

    Systematic outliers in large language models.arXiv preprint arXiv:2502.06415, 2025

    Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. Systematic outliers in large language models.arXiv preprint arXiv:2502.06415, 2025

  7. [7]

    Sparse autoencoders can capture language-specific concepts across diverse languages, 2025

    Lyzander Marciano Andrylie, Inaya Rahmanisa, Mahardika Krisna Ihsani, Alfan Farizki Wicaksono, Haryo Akbarianto Wibowo, and Alham Fikri Aji. Sparse autoencoders can capture language-specific concepts across diverse languages, 2025. URLhttps://arxiv.org/abs/2507.11230

  8. [8]

    Saes are good for steering–if you select the right features.arXiv preprint arXiv:2505.20063, 2025

    Dana Arad, Aaron Mueller, and Yonatan Belinkov. Saes are good for steering–if you select the right features.arXiv preprint arXiv:2505.20063, 2025

  9. [9]

    Refusal in language models is mediated by a single direction

    Andy Arditi, Oscar Balcells Obeso, Aaquib Syed, Daniel Paleka, Nina Rimsky, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/ forum?id=pH3XAQME6c

  10. [10]

    Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems, 37:100213–100240, 2024

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems, 37:100213–100240, 2024. URL https://dl.acm.org/doi/10.5555/3737916.3741096

  11. [11]

    Libbrecht

    Behrooz Azarkhalili and Maxwell W. Libbrecht. Generalized attention flow: Feature attribution for transformer models via maximum flow. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, 54 and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, ...

  12. [12]

    Understanding jailbreak success: A study of latent space dynamics in large language models, 2024

    Sarah Ball, Frauke Kreuter, and Nina Panickssery. Understanding jailbreak success: A study of latent space dynamics in large language models, 2024. URLhttps://arxiv.org/abs/2406.09289

  13. [13]

    Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber- Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans

    Amin Banayeeanzade, Ala N Tak, Fatemeh Bahrani, Anahita Bolourani, Leonardo Blas, Emilio Ferrara, Jonathan Gratch, and Sai Praneeth Karimireddy. Psychological steering in llms: An evaluation of effectiveness and trustworthiness.arXiv preprint arXiv:2510.04484, 2025

  14. [14]

    Steering latent traits, not learned facts: An empirical study of activation control limits.arXiv preprint arXiv:2511.18284, 2025

    Tetiana Bas and Krystian Novak. Steering latent traits, not learned facts: An empirical study of activation control limits.arXiv preprint arXiv:2511.18284, 2025

  15. [15]

    Steering large language model activations in sparse spaces.arXiv preprint arXiv:2503.00177, 2025

    Reza Bayat, Ali Rahimi-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, and Pascal Vincent. Steering large language model activations in sparse spaces.arXiv preprint arXiv:2503.00177, 2025

  16. [16]

    Gallegos, Ryan A

    Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Comput. Linguistics, 48 (1):207–219, 2022. doi: 10.1162/COLI\_A\_00422. URLhttps://doi.org/10.1162/coli_ a_00422

  17. [17]

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112, 2023

  18. [18]

    Mechanistic interpretability for ai safety–a review.arXiv preprint arXiv:2404.14082, 2024

    Leonard Bereska and Efstratios Gavves. Mechanistic interpretability for ai safety–a review.arXiv preprint arXiv:2404.14082, 2024

  19. [19]

    Unveiling visual perception in language models: An attention head analysis approach

    Jing Bi, Junjia Guo, Yunlong Tang, Lianggong Bruce Wen, Zhang Liu, Bingjie Wang, and Chenliang Xu. Unveiling visual perception in language models: An attention head analysis approach. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 4135–4144, 2025. URL https://openaccess.thecvf.com/content/CVPR2025/papers/Bi_Unveiling_ Visu...

  20. [20]

    Hopping too late: Exploring the limitations of large language models on multi-hop queries

    Eden Biran, Daniela Gottesman, Sohee Yang, Mor Geva, and Amir Globerson. Hopping too late: Exploring the limitations of large language models on multi-hop queries. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16,...

  21. [21]

    Open source sparse autoencoders for all residual stream layers of gpt2 small, 2024

    Joseph Bloom. Open source sparse autoencoders for all residual stream layers of gpt2 small, 2024. URL https://www.alignmentforum.org/posts/f9EgfLSurAiqRJySD/ open-source-sparse-autoencoders-for-all-residual-stream

  22. [22]

    Quantizable transform- ers: Removing outliers by helping attention heads do nothing

    Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Quantizable transform- ers: Removing outliers by helping attention heads do nothing. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neu- ral Information Processing Systems, volume 36, pages 75067–75096. Curran Associates, Inc., 2023. URL https://procee...

  23. [23]

    Beyond Multiple Choice: Evaluating Steering Vectors for Summarization

    Joschka Braun, Carsten Eickhoff, and Seyed Ali Bahrainian. Beyond multiple choice: Evaluating steering vectors for adaptive free-form summarization.arXiv preprint arXiv:2505.24859, 2025

  24. [24]

    Attention approximates sparse distributed memory.Advances in Neural Information Processing Systems, 34:15301–15315, 2021

    Trenton Bricken and Cengiz Pehlevan. Attention approximates sparse distributed memory.Advances in Neural Information Processing Systems, 34:15301–15315, 2021

  25. [25]

    Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...

  26. [26]

    Large language models share representations of latent grammatical concepts across typologically diverse languages

    Jannik Brinkmann, Chris Wendler, Christian Bartelt, and Aaron Mueller. Large language models share representations of latent grammatical concepts across typologically diverse languages. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguist...

  27. [27]

    Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410, 2024

    Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410, 2024

  28. [28]

    Locating and Mitigating Gender Bias in Large Language Models, March 2024a

    Yuchen Cai, Ding Cao, Rongxi Guo, Yaqin Wen, Guiquan Liu, and Enhong Chen. Locating and Mitigating Gender Bias in Large Language Models, March 2024a

  29. [29]

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024b

  30. [30]

    Spectral filters, dark signals, and attention sinks

    Nicola Cancedda. Spectral filters, dark signals, and attention sinks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4792–4808, 2024

  31. [31]

    Dissecting Bias in LLMs: A Mechanistic Inter- pretability Perspective, June 2025

    Bhavik Chandna, Zubair Bashir, and Procheta Sen. Dissecting Bias in LLMs: A Mechanistic Inter- pretability Perspective, June 2025

  32. [32]

    Treereview: A dynamic tree of questions framework for deep and efficient llm-based scientific peer review.arXiv preprint arXiv:2506.07642, 2025

    Yuan Chang, Ziyue Li, Hengyuan Zhang, Yuanbo Kong, Yanru Wu, Zhijiang Guo, and Ngai Wong. Treereview: A dynamic tree of questions framework for deep and efficient llm-based scientific peer review.arXiv preprint arXiv:2506.07642, 2025

  33. [33]

    A is for absorption: Studying feature splitting and absorption in sparse autoencoders.CoRR, abs/2409.14507,

    David Chanin, James Wilken-Smith, Tomás Dulka, Hardik Bhatnagar, and Joseph Bloom. A is for absorption: Studying feature splitting and absorption in sparse autoencoders.CoRR, abs/2409.14507,

  34. [34]
  35. [35]

    Transferring linear features across language models with model stitching

    Alan Chen, Jack Merullo, Alessandro Stolfo, and Ellie Pavlick. Transferring linear features across language models with model stitching. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025a. 56

  36. [36]

    Towards under- standing safety alignment: A mechanistic perspective from safety neurons, 2025b

    Jianhui Chen, Xiaozhi Wang, Zijun Yao, Yushi Bai, Lei Hou, and Juanzi Li. Towards under- standing safety alignment: A mechanistic perspective from safety neurons, 2025b. URLhttps: //openreview.net/forum?id=1NkrxqY4jK

  37. [37]

    Identifying query-relevant neurons in large language models for long-form texts

    Lihu Chen, Adam Dejl, and Francesca Toni. Identifying query-relevant neurons in large language models for long-form texts. In Toby Walsh, Julie Shah, and Zico Kolter, editors,AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 23595–23604. AAAI Press, 2025c. doi: 1...

  38. [38]

    Learnable privacy neurons localization in language models

    Ruizhe Chen, Tianxiang Hu, Yang Feng, and Zuozhu Liu. Learnable privacy neurons localization in language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 256–264, Bangkok, Thailand, August 2024a. Association for Computat...

  39. [39]

    Persona Vectors: Monitoring and Controlling Character Traits in Language Models

    Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025d

  40. [40]

    In-context sharpness as alerts: An inner representation perspective for hallucination mitigation

    Shiqi Chen, Miao Xiong, Junteng Liu, Zhengxuan Wu, Teng Xiao, Siyang Gao, and Junxian He. In-context sharpness as alerts: An inner representation perspective for hallucination mitigation. In Forty-first International Conference on Machine Learning, 2024b. URLhttps://openreview.net/ forum?id=s3e8poX3kb

  41. [41]

    From yes-men to truth-tellers: Addressing sycophancy in large language models with pinpoint tuning

    Wei Chen, Zhen Huang, Liang Xie, Binbin Lin, Houqiang Li, Le Lu, Xinmei Tian, Deng Cai, Yong- gang Zhang, Wenxiao Wang, Xu Shen, and Jieping Ye. From yes-men to truth-tellers: Addressing sycophancy in large language models with pinpoint tuning. InForty-first International Conference on Machine Learning, 2024c. URLhttps://openreview.net/forum?id=d2vONO90Rw

  42. [43]

    Journey to the center of the knowl- edge neurons: Discoveries of language-independent knowledge neurons and degenerate knowledge neurons

    Yuheng Chen, Pengfei Cao, Yubo Chen, Kang Liu, and Jun Zhao. Journey to the center of the knowl- edge neurons: Discoveries of language-independent knowledge neurons and degenerate knowledge neurons. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan, editors,Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Con...

  43. [44]

    Yuheng Chen, Pengfei Cao, Yubo Chen, Kang Liu, and Jun Zhao. Knowledge localization: Mission not accomplished? enter query localization! InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025f. URLhttps: //openreview.net/forum?id=tfyHbvFZ0K. 57

  44. [47]

    Binary autoencoder for mechanistic interpretability of large language models.arXiv preprint arXiv:2509.20997, 2025

    Hakaze Cho, Haolin Yang, Brian M Kurkoski, and Naoya Inoue. Binary autoencoder for mechanistic interpretability of large language models.arXiv preprint arXiv:2509.20997, 2025

  45. [48]

    Towardefficientsparseautoencoder-guidedsteeringforimproved in-context learning in large language models

    IkhyunChoandJuliaHockenmaier. Towardefficientsparseautoencoder-guidedsteeringforimproved in-context learning in large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 28949–28961, 2025

  46. [49]

    Glass, and Pengcheng He

    Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James R. Glass, and Pengcheng He. Dola: Decoding by contrasting layers improves factuality in large language models. InThe Twelfth Interna- tional Conference on Learning Representations, 2024. URLhttps://openreview.net/forum? id=Th6NyL07na

  47. [50]

    Representations as language: An information-theoretic framework for interpretability.arXiv preprint arXiv:2406.02449, 2024

    Henry Conklin and Kenny Smith. Representations as language: An information-theoretic framework for interpretability.arXiv preprint arXiv:2406.02449, 2024

  48. [51]

    Towards automated circuit discovery for mechanistic interpretability

    Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga- Alonso. Towards automated circuit discovery for mechanistic interpretability. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 16318–16352. Curran Associates, Inc., 20...

  49. [52]

    What you can cram into a single \$&!#* vector: Probing sentence embeddings for linguistic properties

    Alexis Conneau, Germán Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. What you can cram into a single \$&!#* vector: Probing sentence embeddings for linguistic properties. In Iryna Gurevych and Yusuke Miyao, editors,Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July...

  50. [53]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023

  51. [54]

    Can we interpret latent reasoning using current mechanistic interpretabil- ity tools?, 2025

    Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Josh Engels, Neel Nanda, and Senthooran Rajamanoharan. Can we interpret latent reasoning using current mechanistic interpretabil- ity tools?, 2025. URL https://www.alignmentforum.org/posts/YGAimivLxycZcqRFR/ can-we-interpret-latent-reasoning-using-current-mechanistic

  52. [55]

    Steering off course: Reliability challenges in steering language models.arXiv preprint arXiv:2504.04635, 2025

    Patrick Queiroz Da Silva, Hari Sethuraman, Dheeraj Rajagopal, Hannaneh Hajishirzi, and Sachin Kumar. Steering off course: Reliability challenges in steering language models.arXiv preprint arXiv:2504.04635, 2025. 58

  53. [56]

    Knowledge neurons in pretrained transformers

    Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 8493–85...

  54. [57]

    The cognitive revolution in interpretability: From explaining behavior to interpreting representations and algorithms.arXiv preprint arXiv:2408.05859, 2024

    Adam Davies and Ashkan Khakzar. The cognitive revolution in interpretability: From explaining behavior to interpreting representations and algorithms.arXiv preprint arXiv:2408.05859, 2024

  55. [58]

    Neuron based personality trait induction in large language models

    Jia Deng, Tianyi Tang, Yanbin Yin, Wenhao yang, Xin Zhao, and Ji-Rong Wen. Neuron based personality trait induction in large language models. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=LYHEY783Np

  56. [59]

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35: 30318–30332, 2022

  57. [60]

    Tracing Positional Bias in Financial Decision-Making: Mechanistic Insights from Qwen2.5

    Fabrizio Dimino, Krati Saxena, Bhaskarjit Sarmah, and Stefano Pasquali. Tracing Positional Bias in Financial Decision-Making: Mechanistic Insights from Qwen2.5. InProceedings of the 6th ACM International Conference on AI in Finance, pages 96–104, November 2025. doi: 10.1145/3768292. 3770394

  58. [61]

    From what to how: Attributing clip’s latent components reveals unexpected semantic reliance

    Maximilian Dreyer, Lorenz Hufe, Jim Berend, Thomas Wiegand, Sebastian Lapuschkin, and Wojciech Samek. From what to how: Attributing clip’s latent components reveals unexpected semantic reliance. arXiv preprint arXiv:2505.20229, 2025

  59. [62]

    Towards secure tuning: Mitigating security risks arising from benign instruction fine-tuning.CoRR, abs/2410.04524, 2024

    Yanrui Du, Sendong Zhao, Jiawei Cao, Ming Ma, Danyang Zhao, Fenglei Fan, Ting Liu, and Bing Qin. Towards secure tuning: Mitigating security risks arising from benign instruction fine-tuning.CoRR, abs/2410.04524, 2024. doi: 10.48550/ARXIV.2410.04524. URLhttps://doi.org/10.48550/ arXiv.2410.04524

  60. [63]

    Unveiling language competence neurons: A psycholinguistic approach to model interpretability

    Xufeng Duan, Xinyu Zhou, Bei Xiao, and Zhenguang Cai. Unveiling language competence neurons: A psycholinguistic approach to model interpretability. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguistics, pages 10148–1...

  61. [64]

    The llama 3 herd of models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv e-prints, pages arXiv–2407, 2024

  62. [65]

    Layer-wise quantization: A pragmatic and effective method for quantizing llms beyond integer bit-levels.arXiv preprint arXiv:2406.17415, 2024

    Razvan-Gabriel Dumitru, Vikas Yadav, Rishabh Maheshwary, Paul-Ioan Clotan, Sathwik Tejaswi Madhusudhan, and Mihai Surdeanu. Layer-wise quantization: A pragmatic and effective method for quantizing llms beyond integer bit-levels.arXiv preprint arXiv:2406.17415, 2024

  63. [66]

    A mathemati- cal framework for transformer circuits.Transformer Circuits Thread, 2021

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario 59 Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah...

  64. [67]

    Toy Models of Superposition

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger B. Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition. CoRR, abs/2209.10652, 2022. doi: 10.48550/ARXIV.2209.10652. URLhttps://...

  65. [68]

    LayerSkip: Enabling early exit inference and self-speculative decoding

    Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed Aly, Beidi Chen, and Carole- Jean Wu. LayerSkip: Enabling early exit inference and self-speculative decoding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual ...

  66. [69]

    Sequential integrated gradients: a simple but effective method for explaining language models

    Joseph Enguehard. Sequential integrated gradients: a simple but effective method for explaining language models. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 7555–

  67. [70]

    doi: 10.18653/V1/2023.FINDINGS-ACL.477

    Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.FINDINGS-ACL.477. URLhttps://doi.org/10.18653/v1/2023.findings-acl.477

  68. [71]

    How do language models bind entities in context? InNeurIPS 2023 Workshop on Symmetry and Geometry in Neural Representations, 2023

    Jiahai Feng and Jacob Steinhardt. How do language models bind entities in context? InNeurIPS 2023 Workshop on Symmetry and Geometry in Neural Representations, 2023

  69. [72]

    A primer on the inner workings of transformer-based language models.arXiv preprint arXiv:2405.00208, 2024

    Javier Ferrando, Gabriele Sarti, Arianna Bisazza, and Marta R Costa-Jussà. A primer on the inner workings of transformer-based language models.arXiv preprint arXiv:2405.00208, 2024

  70. [73]

    Truthful or fabricated? using causal attribution to mitigate reward hacking in explanations

    Pedro Ferreira, Wilker Aziz, and Ivan Titov. Truthful or fabricated? using causal attribution to mitigate reward hacking in explanations. InWorkshop on Actionable Interpretability at ICML 2025,

  71. [74]

    URLhttps://arxiv.org/abs/2504.05294

  72. [75]

    Towards empirical in- terpretation of internal circuits and properties in grokked transformers on modular polynomials

    Hiroki Furuta, Gouki Minegishi, Yusuke Iwasawa, and Yutaka Matsuo. Towards empirical in- terpretation of internal circuits and properties in grokked transformers on modular polynomials. Transactions on Machine Learning Research, 2024. URLhttps://openreview.net/forum?id= MzSf70uXJO

  73. [76]

    I have covered all the bases here: Interpreting reasoning features in large language models via sparse autoencoders.arXiv preprint arXiv:2503.18878, 2025

    Andrey Galichin, Alexey Dontsov, Polina Druzhinina, Anton Razzhigaev, Oleg Y Rogov, Elena Tu- tubalina, and Ivan Oseledets. I have covered all the bases here: Interpreting reasoning features in large language models via sparse autoencoders.arXiv preprint arXiv:2503.18878, 2025. URL https://arxiv.org/pdf/2503.18878

  74. [77]

    Exploring mechanistic interpretability in large language models: Challenges, approaches, and insights

    Sandeep Reddy Gantla. Exploring mechanistic interpretability in large language models: Challenges, approaches, and insights. In2025 International Conference on Data Science, Agents & Artificial Intelligence (ICDSAAI), pages 1–8. IEEE, 2025

  75. [78]

    H-neurons: On the existence, impact, and origin of hallucination-associated neurons in llms, 2025a

    Cheng Gao, Huimin Chen, Chaojun Xiao, Zhiyi Chen, Zhiyuan Liu, and Maosong Sun. H-neurons: On the existence, impact, and origin of hallucination-associated neurons in llms, 2025a. URL https://arxiv.org/abs/2512.01797. 60

  76. [79]

    Scaling and evaluating sparse autoencoders

    LeoGao, TomDuprélaTour, HenkTillman, GabrielGoh, RajanTroll, AlecRadford, IlyaSutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093, 2024

  77. [80]

    Weight-sparse transformers have interpretable circuits.arXiv preprint arXiv:2511.13653, 2025b

    Leo Gao, Achyuta Rajaram, Jacob Coxon, Soham V Govande, Bowen Baker, and Dan Mossing. Weight-sparse transformers have interpretable circuits.arXiv preprint arXiv:2511.13653, 2025b

  78. [81]

    Causal abstraction: A theoretical foundation for mechanistic interpretability.Journal of Machine Learning Research, 26(83): 1–64, 2025

    Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, et al. Causal abstraction: A theoretical foundation for mechanistic interpretability.Journal of Machine Learning Research, 26(83): 1–64, 2025

  79. [82]

    Transformer feed-forward layers are key-value memories

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic, November

  80. [83]

    doi: 10.18653/v1/2021.emnlp-main.446

    Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.446. URL https://aclanthology.org/2021.emnlp-main.446/

Showing first 80 references.