Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Chaofan Tao, Chenming Shang, Dongdong Zhang, Ercong Nie, Feijiang Han, Hayden Kwok-Hay So, Hengyuan Zhang, Hinrich Sch\"utze, Hui Shen, Jing Xiong, Mingyang Wang, Ngai Wong, Qianli Wang, Qibo Xue, Qi Zhang, Ruobing Xie, Senjie Jin, Shuzhou Yuan, Sophia Ananiadou, Tao Gui, Xiao Liang, Xuanjing Huang, Xufeng Duan, Yiwei Wang, Zeping Yu, Zhengwu Liu, Zhihao Zhang, Zhiheng Xi, Zunhai Su

Authors on Pith no claims yet

classification 💻 cs.CL

keywords actionableinterventionframeworkimproveinterpretabilitylanguagelargelocate

0 comments

read the original abstract

Mechanistic Interpretability (MI) has emerged as a vital approach to demystify the opaque decision-making of Large Language Models (LLMs). However, existing reviews primarily treat MI as an observational science, summarizing analytical insights while lacking a systematic framework for actionable intervention. To bridge this gap, we present a practical survey structured around the pipeline: "Locate, Steer, and Improve." We formally categorize Localizing (diagnosis) and Steering (intervention) methods based on specific Interpretable Objects to establish a rigorous intervention protocol. Furthermore, we demonstrate how this framework enables tangible improvements in Alignment, Capability, and Efficiency, effectively operationalizing MI as an actionable methodology for model optimization. The curated paper list of this work is available at https://github.com/rattlesnakey/Awesome-Actionable-MI-Survey.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training
cs.CL 2026-05 unverdicted novelty 7.0

Transformer circuits show free evolution during SFT, rendering static mechanistic localization inadequate for future parameter updates due to inherent temporal latency.
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
cs.LG 2026-04 unverdicted novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
Freeze Deep, Train Shallow: Interpretable Layer Allocation for Continued Pre-Training
cs.CL 2026-05 unverdicted novelty 6.0

Freezing deep layers and training shallow layers during continued pre-training of LLMs outperforms full fine-tuning and the opposite allocation on C-Eval and CMMLU, guided by a new layer-sensitivity diagnostic.
From Attribution to Action: A Human-Centered Application of Activation Steering
cs.AI 2026-04 unverdicted novelty 6.0

Activation steering paired with attribution enables intervention-based debugging in vision models, as all 8 interviewed experts shifted to hypothesis testing, most trusted observed responses, and highlighted risks lik...
Head-wise Modality Specialization within MLLMs for Robust Fake News Detection under Missing Modality
cs.CV 2026-04 unverdicted novelty 5.0

Head-wise modality specialization via attention constraints and unimodal knowledge retention in MLLMs improves robustness to missing modalities in fake news detection while preserving full multimodal performance.
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models
cs.CL 2026-05 unverdicted novelty 4.0

Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.