pith. sign in

arxiv: 2605.29901 · v1 · pith:PCCYBRRMnew · submitted 2026-05-28 · 💻 cs.CR · cs.LG

Dissecting the Black Box: Circuit-Level Analysis of LLM Vulnerability Detection

Pith reviewed 2026-06-29 07:06 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords mechanistic interpretabilityLLM vulnerability detectioncircuit tracingattention headssafety detectorsablation experimentscode securityGemma model
0
0 comments X

The pith

LLMs classify code as vulnerable when safety detectors in early layers fail to recognize safe patterns rather than by directly spotting flaws.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies circuit tracing to a small LLM to map the internal pathways used for vulnerability detection on C and C++ samples. It shows the model depends on attention heads that activate on safe coding patterns, with failure to activate leading to a vulnerable label. Key components sit in layers 5, 7 and 11, and ablating them produces large drops in accuracy. The detection process uses only a small fraction of the model's total capacity through these sparse circuits.

Core claim

Using Circuit Tracer on Gemma-2-2b, the analysis reveals that the model primarily relies on safety detectors, attention heads that recognize safe coding patterns, rather than directly detecting vulnerability signatures. When these safety detectors fail to activate, the model classifies code as vulnerable. Specific attention heads in early layers and MLP neurons in Layer 7 encode the relevant features, and ablation experiments confirm their causal role.

What carries the argument

Safety detectors: attention heads in layers 5 and 7 that recognize safe coding patterns, with failure to activate triggering a vulnerable classification.

If this is right

  • The model uses sparse circuits occupying only 16 percent of total capacity for this task.
  • Removing Layer 11 drops vulnerability detection accuracy from 100 percent to 6 percent.
  • Ablating 20 neurons in Layer 7 reduces accuracy by 50 percent.
  • Circuit-level explanations become available for individual security predictions.
  • Targeted improvements to detection systems are possible by modifying the identified components.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same safety-detector logic could appear in other LLM safety tasks such as detecting harmful prompts.
  • If the circuits prove stable across models, developers could edit them directly to change detection behavior without full retraining.
  • Testing the circuits on code in languages other than C and C++ would show whether the mechanism is language-specific.
  • The sparse nature suggests that vulnerability detection might be added or removed from a model with limited changes to a small set of neurons.

Load-bearing premise

The circuit-tracing method and the 472-sample dataset correctly isolate the causal mechanisms without confounding correlations or dataset-specific artifacts.

What would settle it

Repeating the circuit analysis and ablations on a larger, independently collected set of code samples from different sources and checking whether the same layers and neurons remain the controlling components.

Figures

Figures reproduced from arXiv: 2605.29901 by Christian Gehrmann, Chun Zhou, Syafiq Al Atiiq.

Figure 1
Figure 1. Figure 1: L0 norm trends across transformer layers for vul [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Activation heatmap of the top-20 vulnerability [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of the attention pattern for the top [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the sparse computational circuit for CWE-787 (Buffer Overflow). The graph traces the information [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: L0 activation patterns across vulnerability cate [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Layer-wise flip rates using bidirectional causal [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Large language models (LLMs) can detect software vulnerabilities, but how do they actually identify vulnerable code? We address this question using mechanistic interpretability; analyzing the internal computations of a neural network to understand its reasoning process.Using Circuit Tracer on Gemma-2-2b, we trace the computational pathways activated when the model classifies 472 C/C++ code samples as vulnerable or safe. Our analysis reveals a surprising finding: the model primarily relies on safety detectors, attention heads that recognize safe coding patterns, rather than directly detecting vulnerability signatures. When these safety detectors fail to activate, the model classifies code as vulnerable. We identify the critical neural components: specific attention heads in early layers (L5, L7) that focus on safety patterns, and Multilayer Perceptron (MLP) neurons in Layer 7 that encode vulnerability-related features. Ablation experiments confirm their causal role; removing Layer 11 drops vulnerability detection accuracy from 100% to 6%, while ablating just 20 neurons in Layer 7 reduces it by 50%.Our findings show that LLM vulnerability detection uses sparse, interpretable circuits (only 16% of model capacity), enabling circuit-level explanations for security predictions and targeted improvements to detection systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper applies Circuit Tracer to Gemma-2-2b to analyze internal computations on 472 C/C++ code samples labeled vulnerable or safe. It concludes that vulnerability detection is implemented via sparse circuits consisting of safety-detector attention heads in layers 5 and 7 plus MLP neurons in layer 7 that recognize safe patterns; code is classified vulnerable when these components fail to activate. Ablation results are presented as causal confirmation, with Layer 11 removal dropping accuracy from 100% to 6% and ablation of 20 Layer-7 neurons reducing accuracy by 50%, and the full circuit using only 16% of model capacity.

Significance. If the traced circuits prove general rather than dataset-specific, the work supplies concrete, sparse mechanistic explanations for an LLM security task and demonstrates that ablation can isolate causally relevant components. The emphasis on safety-pattern detectors rather than direct vulnerability signatures is a potentially useful reframing for interpretability-guided improvements to detection systems.

major comments (3)
  1. [Abstract] Abstract: the 100% baseline accuracy on the 472-sample set is reported without any description of sample selection, balancing, or diversity criteria; this makes it impossible to assess whether the identified safety-detector heads and Layer-7 neurons encode general mechanisms or corpus-specific correlations present only in this narrow distribution.
  2. [Abstract] Abstract (ablation paragraph): the reported drops (Layer 11 to 6%, 20 neurons in Layer 7 to 50% reduction) are given without error bars, statistical controls for correlated effects, or verification that the ablations were performed on held-out data; these omissions prevent the results from establishing robust causality for the central claim that the model relies on absence of safe patterns.
  3. [Abstract] Abstract: the claim that the model 'primarily relies on safety detectors ... rather than directly detecting vulnerability signatures' is load-bearing, yet the perfect accuracy on 472 samples leaves open the possibility that the traced components simply encode safe-coding idioms that happen to be anti-correlated with the vulnerable labels in this particular set; no cross-dataset or out-of-distribution test is referenced to distinguish the two interpretations.
minor comments (1)
  1. [Abstract] The abstract states the circuit uses 'only 16% of model capacity' but does not define how this percentage is computed (e.g., fraction of heads/neurons, FLOPs, or parameter count).

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for these detailed comments on the abstract. We will revise the abstract to include dataset context and to qualify the scope of our claims. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the 100% baseline accuracy on the 472-sample set is reported without any description of sample selection, balancing, or diversity criteria; this makes it impossible to assess whether the identified safety-detector heads and Layer-7 neurons encode general mechanisms or corpus-specific correlations present only in this narrow distribution.

    Authors: The Methods section of the full manuscript describes the 472-sample dataset, which was constructed from standard C/C++ vulnerability benchmarks with explicit balancing across vulnerable and safe labels and coverage of multiple CWE categories. To address the concern for readers of the abstract, we will add a concise statement on sample selection and diversity criteria. revision: yes

  2. Referee: [Abstract] Abstract (ablation paragraph): the reported drops (Layer 11 to 6%, 20 neurons in Layer 7 to 50% reduction) are given without error bars, statistical controls for correlated effects, or verification that the ablations were performed on held-out data; these omissions prevent the results from establishing robust causality for the central claim that the model relies on absence of safe patterns.

    Authors: We agree that error bars and clearer procedural details would improve robustness. The reported ablations were performed on the circuit-discovery set, which is standard for demonstrating in-distribution causal effects in mechanistic interpretability. We will add error bars computed over multiple random seeds for the neuron ablations and clarify the in-distribution scope. Held-out ablation verification is not present in the current work. revision: partial

  3. Referee: [Abstract] Abstract: the claim that the model 'primarily relies on safety detectors ... rather than directly detecting vulnerability signatures' is load-bearing, yet the perfect accuracy on 472 samples leaves open the possibility that the traced components simply encode safe-coding idioms that happen to be anti-correlated with the vulnerable labels in this particular set; no cross-dataset or out-of-distribution test is referenced to distinguish the two interpretations.

    Authors: The circuit-tracing and ablation results within this dataset support the safety-detector interpretation, as the identified components activate preferentially on safe patterns. We will revise the abstract to explicitly limit the claim to the analyzed 472-sample distribution and to note that broader generality would require additional validation. revision: yes

standing simulated objections not resolved
  • We do not have cross-dataset or out-of-distribution evaluations in the current manuscript to distinguish general mechanisms from dataset-specific correlations.

Circularity Check

0 steps flagged

No circularity: empirical circuit tracing and ablations are direct measurements

full rationale

The paper's claims rest entirely on experimental results from Circuit Tracer applied to Gemma-2-2b and ablation tests on a 472-sample C/C++ dataset. Key findings (safety-detector heads in L5/L7, MLP neurons in L7, accuracy drops from 100% to 6% when ablating Layer 11) are reported as observed outcomes of these interventions, not as derivations, predictions, or first-principles results that reduce to the inputs by construction. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. The work is self-contained against its own empirical benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work is an empirical application of existing circuit-tracing methods to a new domain.

pith-pipeline@v0.9.1-grok · 5752 in / 1076 out tokens · 33227 ms · 2026-06-29T07:06:40.581728+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 12 canonical work pages · 7 internal anchors

  1. [1]

    Leonard Bereska and Efstratios Gavves. 2024. Mechanistic Interpretability for AI Safety – A Review.Transactions on Machine Learning Research(2024)

  2. [2]

    Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, and Baishakhi Ray. 2022. Deep learning based vulnerability detection: Are we there yet?. InIEEE Transac- tions on Software Engineering, Vol. 48. 3280–3296

  3. [3]

    Yizheng Chen, Zhoujie Ding, Lamya Alowain, Xinyun Chen, and David Wagner

  4. [4]

    InProceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses (RAID ’23)

    DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection. InProceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses (RAID ’23). Association for Com- puting Machinery, New York, NY, USA, 654–668. doi:10.1145/3607199.3607242

  5. [5]

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey

  6. [6]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models.arXiv preprint arXiv:2309.08600(2023)

  7. [7]

    Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. 2025. Vulnerability Detection with Code Language Models: How Far Are We?. In Proceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE). arXiv preprint arXiv:2403.18624

  8. [8]

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. 2021. A Mathematical Framework for Transformer Circuits.Transformer Circuits Thread (2021)

  9. [9]

    Jiahao Fan, Yi Li, Shaohua Wang, and Tien N. Nguyen. 2020. A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries. InProceedings of the 17th International Conference on Mining Software Repositories(Seoul, Republic of Korea)(MSR ’20). Association for Computing Machinery, New York, NY, USA, 508–512. doi:10.1145/3379597.3387501

  10. [10]

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al . 2020. CodeBERT: A Pre- Trained Model for Programming and Natural Languages. InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2020

  11. [11]

    Amirata Ghorbani, James Wexler, James Y Zou, and Been Kim. 2019. Towards automatic concept-based explanations.Advances in Neural Information Processing Systems32 (2019)

  12. [12]

    Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. 2018. Visualizing the loss landscape of neural nets. InAdvances in Neural Information Processing Systems, Vol. 31

  13. [13]

    Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, and Zhaoxuan Chen

  14. [14]

    In IEEE Transactions on Dependable and Secure Computing

    Automated Software Vulnerability Detection with Machine Learning. In IEEE Transactions on Dependable and Secure Computing

  15. [15]

    Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. 2018. VulDeePecker: A Deep Learning-Based System for Vulnerability Detection. InProceedings of the 25th Annual Network and Distributed System Security Symposium (NDSS)

  16. [16]

    Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. 2024. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 278–300

  17. [17]

    Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy- Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zi- jian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgeni...

  18. [18]

    Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller

    Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. 2025. Sparse Feature Circuits: Discovering and Editing Inter- pretable Causal Graphs in Language Models. InInternational Conference on Learn- ing Representations

  19. [19]

    Samuel Marks and Max Tegmark. 2023. The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets.arXiv preprint arXiv:2310.06824(2023)

  20. [20]

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and Editing Factual Associations in GPT. InAdvances in Neural Information Processing Systems, Vol. 35. 17359–17372

  21. [21]

    Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt

  22. [22]

    Progress measures for grokking via mechanistic interpretability.arXiv preprint arXiv:2301.05217(2023). 10

  23. [23]

    Changan Niu, Chuanyi Li, Vincent Ng, Bin Luo, and Jidong Chen. 2022. SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations. arXiv preprint arXiv:2201.01549(2022)

  24. [24]

    Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Katherine Ye, and Alexander Mordvintsev. 2018. The building blocks of inter- pretability.Distill3, 3 (2018), e10

  25. [25]

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. 2022. In-context learning and induction heads.arXiv preprint arXiv:2209.11895(2022)

  26. [26]

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. 2022. In- context Learning and Induction Heads.Transformer Circuits Thread(2022). https: //transformer-circuits.pub/2022/in-context-learning-and-induction-heads/

  27. [27]

    Mateusz Pach, Shyamgopal Karthik, Quentin Bouniot, Serge Belongie, and Zeynep Akata. 2025. Sparse autoencoders learn monosemantic features in vision-language models.arXiv preprint arXiv:2504.02821(2025)

  28. [28]

    Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. 2021. Do vision transformers see like convolutional neural networks?. InAdvances in Neural Information Processing Systems, Vol. 34. 12116– 12128

  29. [29]

    Safety Research. 2024. Circuit Tracer: A Tool for Mechanistic Interpretability of Neural Networks. https://github.com/safety-research/circuit-tracer. Accessed: 2025-11-12

  30. [30]

    Benjamin Steenhoek, Md Mahbubur Rahman, Shaila Jiles, and Wei Le. 2023. An empirical study of deep learning models for vulnerability detection. InProceedings of the 45th International Conference on Software Engineering (ICSE). 2237–2249

  31. [31]

    Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. In- telliCode compose: Code generation using transformer. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

  32. [32]

    Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, et al

  33. [33]

    Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.Anthropic(2024)

  34. [34]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

  35. [36]

    Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2022. Interpretability in the Wild: a Circuit for Indirect Object Identi- fication in GPT-2 small.arXiv preprint arXiv:2211.00593(2022)

  36. [37]

    Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. InAdvances in Neural Information Processing Systems, Vol. 32

  37. [39]

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al

  38. [40]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Representation Engineering: A Top-Down Approach to AI Transparency. arXiv preprint arXiv:2310.01405(2023). 11