pith. machine review for the scientific record. sign in

arxiv: 2604.03764 · v1 · submitted 2026-04-04 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Automated Attention Pattern Discovery at Scale in Large Language Models

Arie van Deursen, Erik Mekkes, Jonathan Katzy, Maliheh Izadi, Razvan-Mihai Popescu

Pith reviewed 2026-05-13 18:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords attention patternsmasked autoencodermechanistic interpretabilitylarge language modelscode generationStarCoder2vision transformersmodel interventions
0
0 comments X

The pith

A masked autoencoder on attention patterns from code completions reconstructs them accurately, generalizes across models, and raises generation accuracy 13.6 percent via selective interventions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper collects attention patterns generated by large language model heads during code completion tasks in Java datasets to treat them as scalable signals. It trains AP-MAE, a vision transformer model, to reconstruct masked versions of these patterns. Experiments on StarCoder2 demonstrate high reconstruction accuracy, minimal degradation on unseen models, identification of recurring patterns, and prediction of correct generations at 55 to 70 percent accuracy without ground truth. Selective interventions derived from the model improve accuracy by 13.6 percent while excessive application leads to collapse. The approach positions attention patterns as a foundation for both large-scale analysis and targeted edits that can guide finer mechanistic work.

Core claim

AP-MAE reconstructs masked attention patterns with high accuracy, generalizes across unseen models with minimal degradation, reveals recurring patterns across inferences, predicts whether a generation will be correct without access to ground truth at 55 to 70 percent accuracy depending on the task, and enables targeted interventions that increase accuracy by 13.6 percent when applied selectively but cause collapse when applied excessively.

What carries the argument

The Attention Pattern-Masked Autoencoder (AP-MAE), a vision transformer-based model that reconstructs masked attention patterns collected from LLM heads during structured code completions.

If this is right

  • Attention patterns act as a scalable signal that supports global interpretability across large models without per-behavior case studies.
  • Recurring patterns can be mined automatically from completion scenarios in structured domains such as code.
  • AP-MAE predictions of generation correctness can be used without requiring ground-truth labels.
  • Targeted edits to attention patterns improve accuracy when applied selectively and provide a selection procedure for finer mechanistic analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same attention-pattern reconstruction could be tested on non-code tasks to check whether recurring patterns generalize beyond structured data.
  • If the 13.6 percent gain holds under broader conditions, AP-MAE outputs might serve as a lightweight prior for steering generation in deployment settings.
  • The transferability across models suggests attention templates could be catalogued once and reused to bootstrap interpretability on new architectures.

Load-bearing premise

Attention patterns produced by LLM attention heads can be treated as scalable, image-like signals that vision models can analyze for global interpretability of model components.

What would settle it

Running AP-MAE on a new model family outside StarCoder2 and observing whether reconstruction accuracy falls below the reported high levels or whether the selective intervention gain disappears.

Figures

Figures reproduced from arXiv: 2604.03764 by Arie van Deursen, Erik Mekkes, Jonathan Katzy, Maliheh Izadi, Razvan-Mihai Popescu.

Figure 1
Figure 1. Figure 1: Comparison of attention pattern reconstruction methods: [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of different clustering results: (a) examples of different patterns found by clustering, [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of the number of clusters in a head [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance of target LLMs on the studied tasks, and the accuracy of the CatBoost classifier [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Difference in mean SHAP values per cluster for the CatBoost classifiers, classifying predictions for [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Maximum difference in mean SHAP values per cluster explaining the global effect of each pattern [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effects of interventions based on SHAP values showing the net change in correct token generation [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: SHAP values for the assignment operators task. [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: SHAP values for the boolean literals task. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: SHAP values for the boolean operators task. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: SHAP values for the closing brackets task. [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: SHAP values for the end-of-line token task. [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: SHAP values for the identifiers task. 0 4 8 12 16 20 Heads 0 6 12 18 24 Layers 3B (30×24) 0 7 14 21 28 35 Heads 0 6 12 18 24 30 7B (32×36) 0 9 18 27 36 45 Heads 0 8 16 24 32 15B (40×48) −0.02 −0.01 0.00 0.01 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: SHAP values for the mathematical operators task. [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: SHAP values for the numeric literals task. [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: SHAP values for the random masking task. [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: SHAP values for the string literals task. [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Net change in number of correct generations after intervention for the assignment operators task. [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Net change in number of correct generations after intervention for the boolean literals task. [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Net change in number of correct generations after intervention for the boolean operators task. [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Net change in number of correct generations after intervention for the closing brackets task. [PITH_FULL_IMAGE:figures/full_fig_p024_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Net change in number of correct generations after intervention for the end-of-line token task. [PITH_FULL_IMAGE:figures/full_fig_p025_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Net change in number of correct generations after intervention for the identifiers task. [PITH_FULL_IMAGE:figures/full_fig_p025_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Net change in number of correct generations after intervention for the mathematical operators [PITH_FULL_IMAGE:figures/full_fig_p026_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Net change in number of correct generations after intervention for the numeric literals task. [PITH_FULL_IMAGE:figures/full_fig_p026_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Net change in number of correct generations after intervention for the random masking task. [PITH_FULL_IMAGE:figures/full_fig_p027_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Net change in number of correct generations after intervention for the string literals task. [PITH_FULL_IMAGE:figures/full_fig_p027_27.png] view at source ↗
read the original abstract

Large language models have found success by scaling up capabilities to work in general settings. The same can unfortunately not be said for interpretability methods. The current trend in mechanistic interpretability is to provide precise explanations of specific behaviors in controlled settings. These often do not generalize, or are too resource intensive for larger studies. In this work we propose to study repeated behaviors in large language models by mining completion scenarios in Java code datasets, through exploiting the structured nature of code. We collect the attention patterns generated in the attention heads to demonstrate that they are scalable signals for global interpretability of model components. We show that vision models offer a promising direction for analyzing attention patterns at scale. To demonstrate this, we introduce the Attention Pattern - Masked Autoencoder(AP-MAE), a vision transformer-based model that efficiently reconstructs masked attention patterns. Experiments on StarCoder2 show that AP-MAE (i) reconstructs masked attention patterns with high accuracy, (ii) generalizes across unseen models with minimal degradation, (iii) reveals recurring patterns across inferences, (iv) predicts whether a generation will be correct without access to ground truth, with accuracies ranging from 55% to 70% depending on the task, and (v) enables targeted interventions that increase accuracy by 13.6% when applied selectively, but cause collapse when applied excessively. These results establish attention patterns as a scalable signal for interpretability and demonstrate that AP-MAE provides a transferable foundation for both analysis and intervention in large language models. Beyond its standalone value, AP-MAE also serves as a selection procedure to guide fine-grained mechanistic approaches. We release code and models to support future work in large-scale interpretability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes AP-MAE, a vision-transformer-based masked autoencoder, to mine and reconstruct attention patterns from StarCoder2 on Java code completions. It claims these patterns constitute scalable signals for global LLM interpretability, demonstrating high-fidelity masked reconstruction, cross-model generalization, discovery of recurring patterns, 55-70% accuracy in predicting generation correctness without ground truth, and selective interventions yielding a 13.6% accuracy gain (with collapse under excessive application).

Significance. If the empirical results are robust, the work supplies a practical, transferable foundation for large-scale interpretability by repurposing vision models on attention maps, while also serving as a filter for finer mechanistic studies. The release of code and models is a concrete asset for reproducibility.

major comments (2)
  1. [Abstract / Experiments] Abstract and Experiments section: the reported reconstruction accuracy, 55-70% prediction accuracies, and 13.6% intervention gain are presented without any description of data splits, masking ratios, baseline comparisons, or statistical error bars, rendering it impossible to assess whether these figures support the central claim that attention patterns are general scalable signals.
  2. [Experiments] Experiments section: all quantitative results (reconstruction, generalization, prediction, intervention) are obtained exclusively on StarCoder2 Java completions; the manuscript must demonstrate that the observed recurring patterns and AP-MAE performance persist on non-code domains or open-ended text, otherwise the scalability claim risks being an artifact of syntactic regularity in code.
minor comments (2)
  1. Clarify the precise preprocessing step that converts raw attention tensors into image-like inputs suitable for the vision transformer; the current description leaves the spatial layout and channel handling ambiguous.
  2. Add a short ablation on the effect of masking ratio on reconstruction fidelity and downstream prediction accuracy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating the revisions we plan to make to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the reported reconstruction accuracy, 55-70% prediction accuracies, and 13.6% intervention gain are presented without any description of data splits, masking ratios, baseline comparisons, or statistical error bars, rendering it impossible to assess whether these figures support the central claim that attention patterns are general scalable signals.

    Authors: We agree that more details are needed for proper assessment. In the revised version, we will update the Experiments section and abstract if necessary to describe the data splits (training on a large set of Java completions and testing on held-out examples), the masking ratios employed in AP-MAE (following the standard 75% masking from the MAE framework), baseline comparisons including against random masking or linear interpolation methods, and include error bars from repeated experiments with different seeds to report mean and standard deviation for the reconstruction accuracy, prediction accuracies, and intervention gains. revision: yes

  2. Referee: [Experiments] Experiments section: all quantitative results (reconstruction, generalization, prediction, intervention) are obtained exclusively on StarCoder2 Java completions; the manuscript must demonstrate that the observed recurring patterns and AP-MAE performance persist on non-code domains or open-ended text, otherwise the scalability claim risks being an artifact of syntactic regularity in code.

    Authors: We acknowledge this valid concern regarding the domain specificity. Our choice of Java code completions was to exploit the syntactic regularity for scalable pattern discovery, as outlined in the introduction. To address the scalability to general LLMs, we will add to the revised manuscript a limitations and future work section that discusses how AP-MAE can be applied to non-code domains, and we will provide preliminary evidence by analyzing attention patterns from a different task if possible. However, conducting full quantitative experiments on open-ended text is beyond the current experimental setup and would constitute a substantial extension. revision: partial

Circularity Check

0 steps flagged

No circularity: all claims are direct empirical measurements on held-out StarCoder2 data

full rationale

The paper presents AP-MAE as a vision-transformer model trained to reconstruct masked attention maps from Java code completions in StarCoder2. All five listed results—reconstruction accuracy, cross-model generalization, recurring pattern discovery, 55-70% correctness prediction, and selective 13.6% accuracy gains—are obtained by running the trained model on separate test splits and unseen models, with no equations, fitted parameters, or self-citations invoked to derive one quantity from another by construction. The work therefore contains no self-definitional, fitted-input-called-prediction, or self-citation-load-bearing steps; the derivation chain is simply data collection followed by standard supervised evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that attention patterns behave sufficiently like image data for vision-transformer techniques to transfer; no free parameters or invented entities are specified in the abstract.

axioms (1)
  • domain assumption Attention patterns can be treated as image-like data suitable for analysis by vision transformers.
    This premise enables the application of the masked autoencoder AP-MAE.

pith-pipeline@v0.9.0 · 5620 in / 1300 out tokens · 66657 ms · 2026-05-13T18:22:13.312457+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 6 internal anchors

  1. [1]

    URLhttps://transformer-circuits.pub/2025/attribution-graphs/methods.html. 14 Published in Transactions on Machine Learning Research (02/2026) Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korine...

  2. [2]

    URLhttps://openreview.net/forum?id=oVTkOs8Pka

    ISSN 2835-8856. URLhttps://openreview.net/forum?id=oVTkOs8Pka. Survey Certification, Expert Certification. Tanja Baeumel, Daniil Gurgurov, Yusser al Ghussin, Josef van Genabith, and Simon Ostermann. Modular arithmetic: Language models solve math digit by digit.arXiv preprint arXiv:2508.02513,

  3. [3]

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens.arXiv preprint arXiv:2303.08112,

  4. [4]

    Vivien Cabannes, Charles Arnal, Wassim Bouaziz, Xingyu Alice Yang, Francois Charton, and Julia Kempe

    https://transformer-circuits.pub/2023/monosemantic-features/index.html. Vivien Cabannes, Charles Arnal, Wassim Bouaziz, Xingyu Alice Yang, Francois Charton, and Julia Kempe. Iteration head: A mechanistic study of chain-of-thought. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

  5. [5]

    arXiv , url =:2407.07071 , primaryclass =

    Springer Berlin Heidelberg. ISBN 978-3-642-37456-2. Yung-Sung Chuang, Linlu Qiu, Cheng-Yu Hsieh, Ranjay Krishna, Yoon Kim, and James Glass. Lookback lens: Detecting and mitigating contextual hallucinations in large language models using only attention maps.arXiv preprint arXiv:2407.07071,

  6. [6]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    URLhttps://proceedings.neurips.cc/paper_files/ paper/2023/file/34e1dbe95d34d7ebaf99b9bcaeb5b2be-Paper-Conference.pdf. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600,

  7. [7]

    A mathemat- ical framework for transformer circuits.Transformer Circuits Thread,

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario 15 Published in Transactions on Machine Learning Research (02/2026) Amodei, To...

  8. [8]

    Xuyang Ge, Fukang Zhu, Wentao Shu, Junxuan Wang, Zhengfu He, and Xipeng Qiu

    https://transformer- circuits.pub/2021/framework/index.html. Xuyang Ge, Fukang Zhu, Wentao Shu, Junxuan Wang, Zhengfu He, and Xipeng Qiu. Automatically identi- fying local and global circuits with linear computation graphs. InICML 2024 Workshop on Mechanistic Interpretability,

  9. [9]

    Transformer Feed-Forward Layers Are Key-Value Memories

    URLhttps://openreview.net/forum?id=b8sq8Y5VFo. Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories.arXiv preprint arXiv:2012.14913,

  10. [10]

    Localizing Model Behavior with Path Patching , journal =

    Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, and Aryaman Arora. Localizing model behavior with path patching.arXiv preprint arXiv:2304.05969,

  11. [11]

    Probing self-attention in self-supervised speech models for cross- linguistic differences.arXiv preprint arXiv:2409.03115,

    Sai Gopinath and Joselyn Rodriguez. Probing self-attention in self-supervised speech models for cross- linguistic differences.arXiv preprint arXiv:2409.03115,

  12. [12]

    doi: 10.1007/978-0-387-68954-8

    ISBN 978-0-387-68954-8. doi: 10.1007/978-0-387-68954-8. URLhttps://doi.org/10.1007/ 978-0-387-68954-8. Michael Hanna, Ollie Liu, and Alexandre Variengien. How does gpt-2 compute greater-than?: Interpret- ing mathematical abilities in a pre-trained language model.Advances in Neural Information Processing Systems, 36:76033–76060,

  13. [13]

    ISBN 978-3-642-76626-8

    Springer Berlin Heidelberg. ISBN 978-3-642-76626-8. Jonathan Katzy, Razvan Mihai Popescu, Arie van Deursen, and Maliheh Izadi. The heap: A contamination- free multilingual code dataset for evaluating large language models. InProceedings 2nd ACM international conference on AI Foundation Models and Software Engineering (FORGE 2025). ACM,

  14. [14]

    Bingli Li and Danilo Vasconcellos Vargas

    URLhttps: //arxiv.org/abs/2501.09653. Bingli Li and Danilo Vasconcellos Vargas. Extending token computation for llm reasoning. InProceedings of the 2024 8th International Conference on Computer Science and Artificial Intelligence, pp. 367–373,

  15. [15]

    Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla , shorttitle =

    Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, and Vladimir Mikulik. Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chin- chilla.arXiv preprint arXiv:2307.09458,

  16. [16]

    16 Published in Transactions on Machine Learning Research (02/2026) Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, A...

  17. [17]

    URL https://transformer-circuits.pub/2025/attribution-graphs/biology.html. Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang...

  18. [18]

    StarCoder 2 and The Stack v2: The Next Generation

    URLhttps://arxiv.org/abs/2402.19173. Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions.Advances in neural information processing systems, 30,

  19. [19]

    URLhttps://openreview.net/forum?id=O9YTt26r2P. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, TomBrow...

  20. [20]

    In-context Learning and Induction Heads

    URLhttps://doi.org/10.48550/arXiv.2209.11895. Gustaw Opiełka, Hannes Rosenbusch, and Claire E Stevenson. Analogical reasoning inside large language models: Concept vectors and the limits of abstraction.arXiv preprint arXiv:2503.03666,

  21. [22]

    arXiv preprint arXiv:1511.08458 (2015)

    URLhttp://arxiv.org/abs/1511.08458. GonçaloPaulo, StepanShabalin, andNoraBelrose. Transcodersbeatsparseautoencodersforinterpretability. arXiv preprint arXiv:2501.18823,

  22. [23]

    A practical review of mechanistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646,

    Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, and Ziyu Yao. A practical review of mechanistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646,

  23. [24]

    doi: 10.18653/v1/P19-1580

    Association for Computational Linguistics. doi: 10.18653/v1/P19-1580. URLhttps://aclanthology.org/P19-1580/. 17 Published in Transactions on Machine Learning Research (02/2026) Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretabil- ity in the wild: a circuit for indirect object identification in GPT-2 small...

  24. [25]

    Yu Wang, Kamalika Das, Xiang Gao, Wendi Cui, Peng Li, and Jiaxin Zhang

    URLhttps://openreview.net/forum?id=NpsVSN6o4ul. Yu Wang, Kamalika Das, Xiang Gao, Wendi Cui, Peng Li, and Jiaxin Zhang. Gradient-guided attention map editing: Towards efficient contextual hallucination mitigation.arXiv preprint arXiv:2503.08963,

  25. [26]

    Attention heads of large language models: A survey.arXiv preprint arXiv:2409.03752,

    Zifan Zheng, Yezhaohui Wang, Yuxin Huang, Shichao Song, Mingchuan Yang, Bo Tang, Feiyu Xiong, and Zhiyu Li. Attention heads of large language models: A survey.arXiv preprint arXiv:2409.03752,

  26. [27]

    In this section we include the plots of all average SHAP values for each task and model

    18 Published in Transactions on Machine Learning Research (02/2026) A SHAP values per task Due to the large number of plots, we include only a select few in the main body of the paper. In this section we include the plots of all average SHAP values for each task and model. These were the values used to select which heads to zero in the intervention stage....