pith. machine review for the scientific record. sign in

arxiv: 2605.03052 · v1 · submitted 2026-05-04 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

How Language Models Process Negation

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:25 UTC · model grok-4.3

classification 💻 cs.CL
keywords negation processinglarge language modelsmechanistic interpretabilityattention mechanismsLLM internalsconstructive computation
0
0 comments X

The pith

Language models process negation by constructing representations of negative phrases more than by suppressing positives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how models like Mistral-7B and Llama-3.1-8B handle negation inside their layers. It shows that the models contain correct internal signals for negation yet often output wrong answers because late attention heads favor quick shortcuts. Removing those shortcut heads raises accuracy on negation questions. Two processing routes appear: attention heads that focus on the negated span and dampen related ideas, and a stronger route that builds a direct vector for the full negative phrase such as promoting alternatives to the negated term. The constructive route dominates in the studied models.

Core claim

Models implement both suppressive attention heads that attend to negated phrases and suppress associated concepts, and constructive mechanisms that directly encode the negated phrase as a vector promoting alternatives. The constructive mechanism is more prominent, and ablating late-layer attention modules that promote shortcuts markedly improves accuracy on negation-related tasks.

What carries the argument

The constructive mechanism that builds a representation of the entire negative phrase as a vector promoting alternatives, operating alongside suppressive attention heads.

If this is right

  • Ablating late-layer attention modules that promote shortcuts greatly improves accuracy on questions involving negation.
  • Models implement both a suppressive route and a stronger constructive route for negation.
  • The constructive route encodes the full negative phrase directly rather than only damping positives.
  • Both mechanisms coexist inside the same models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training methods could be adjusted to favor the constructive route and reduce reliance on shortcut attention.
  • Similar dual-mechanism patterns may appear for other logical operators such as quantifiers or conditionals.
  • Interpretability tools that separate competing internal routes could generalize to debugging other logical failures.

Load-bearing premise

The observational and causal interpretability techniques such as attention ablation and activation analysis accurately isolate the negation mechanisms without interference from other components.

What would settle it

No gain in accuracy on negation questions after ablating the identified late-layer attention modules, or activation patterns that fail to match either the constructive vector or the suppressive attention behavior.

Figures

Figures reproduced from arXiv: 2605.03052 by Jonathan May, Robin Jia, Tianyi Zhou, Zhejian Zhou.

Figure 1
Figure 1. Figure 1: Illustration of competing mechanisms for negation. In the negation mechanism, attention module A1 moves the represen￾tation of the negation token (“not”) to the position of the concept being negated (e.g., amphibian). Subsequently, A2, together with downstream MLPs, constructs and promotes a new negated rep￾resentation (e.g., mammal). In contrast, the shortcut mechanism bypasses explicit negation reasoning… view at source ↗
Figure 2
Figure 2. Figure 2: OLMo2 Positive and Negative Accuracies at various pre-training checkpoints. We observe that negative accuracy first plummets at early training steps, then rises again and stabilizes. ablates their functionalities. If switching off certain heads recovers expected behaviors on P−, then we conclude that those attention modules play a causally significant role in shortcut behavior. More specifically, we apply … view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of PCA subspace. The residual stream hidden states are taken from Llama-3.1-8B at layer 11 after the attention module (11 mid). The hidden states of P+ and P− are colored as blue and red. Arrows indicate the direction from one hidden state of P+ to the corresponding hidden state of P−. It can be seen that positive and negative hidden states are approximately linearly separable by one directio… view at source ↗
Figure 5
Figure 5. Figure 5: Normalized evidence count plotted against attention layer index. The normalized evidence count measures the percentage of samples in the dataset for which evidence is identified at a given layer. Results are on Llama-3.1-8B. The blue (red) line indicates the ratio of samples that LogitLens identifies Y¯ (Y ) related tokens as top promoted (demoted) tokens. We observe that evidence count peaks at causally i… view at source ↗
Figure 6
Figure 6. Figure 6: Path Patching and Attention Sink Ablation results on Llama-3.1-8B. X axis indicates the center layer that we ablate or patch. Y axis is negation accuracy. Both methods suggest that mid-layer attention modules are causally important for negation processing. "justification": "These tokens suggest non-natural, persistent materials, which are incompatible with biodegradability." ,→ ,→ ,→ } ] ``` If **no convin… view at source ↗
Figure 7
Figure 7. Figure 7: Path Patching and Attention Sink Ablation results on Mistral-7B. X axis indicates the center layer that we ablate or patch. Y axis is negation accuracy. Both methods suggest that mid-layer attention modules are causally important for negation processing. C. Additional Analyses C.1. Sanity Check on Negation Sensitivity Is sensitivity defined in Section 4.1 just an artifact of random￾ness? Is our choice of o… view at source ↗
Figure 8
Figure 8. Figure 8: Cross-validated LDA model accuracy as a function of position in the residual stream. A higher accuracy indicates that we can decode “not” from the residual stream more reliably. As shown, “not” is moved to “Y” at early to middle layers. C.5. LLM Annotation Results for Mistral-7B-v0.1 In view at source ↗
Figure 9
Figure 9. Figure 9: Normalized evidence count plotted against attention layer index. Results are on mistralai/Mistral-7B-v0.1. The blue (red) line indicates the ratio of samples that LogitLens identifies Y¯ (Y ) related tokens as top promoted (demoted) tokens. We observe that evidence count for “not” follows the same trends as patching results in view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of the PCA space at different model layers. The hidden states of P+ and P− are colored as blue and red. Arrows indicate the direction from one hidden state of P+ to the corresponding hidden state of P−. It can be seen that positive and negative hidden states are approximately linearly separable by one direction. 18 view at source ↗
read the original abstract

We study how Large Language Models (LLMs) process negation mechanistically. First, we establish that even though open-weight models often provide wrong answers to questions involving negation, they do possess internal components that process negation correctly. Their poor accuracy is due to late-layer attention behavior that promotes simple shortcuts; ablating those attention modules greatly improves accuracy on negation-related questions. Second, we uncover how models process negation. We consider two hypotheses: models could use attention heads that attend to the phrase being negated and suppress related concepts, or they could directly construct a representation of the entire negative phrase (e.g., representing "not gas" as a vector that promotes liquids and solids). We apply a range of observational and causal interpretability techniques on Mistral-7B and Llama-3.1-8B to show that models implement both mechanisms, with the "constructive" mechanism being more prominent. Combined, our work deepens the understanding of LLMs' internals, highlighting construction-dominant computations and the coexistence of competing mechanisms within LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs such as Mistral-7B and Llama-3.1-8B possess internal components that correctly process negation, but late-layer attention promotes shortcut behaviors that degrade accuracy on negation-related questions; ablating those modules improves performance. It tests two hypothesized mechanisms—suppressive attention heads that attend to negated phrases versus constructive construction of negative representations (e.g., 'not gas' promoting liquids/solids)—and applies observational and causal interpretability techniques to conclude that both are implemented, with the constructive mechanism being more prominent.

Significance. If the causal interventions cleanly isolate the two mechanisms without residual cross-talk from distributed attention, the work provides concrete mechanistic evidence for coexistence of competing computations in LLMs and a dominance ordering favoring construction over suppression. This strengthens the case for construction-dominant circuits in logical reasoning and supplies falsifiable predictions about ablation effects that could guide future circuit-level analyses.

major comments (2)
  1. [Abstract / causal-intervention results] Abstract and the causal-intervention section: the claim that ablating late-layer attention modules 'greatly improves accuracy' on negation questions is load-bearing for the shortcut hypothesis, yet the abstract supplies no quantitative effect sizes, baseline comparisons, or controls showing that the same ablation leaves non-negation performance unchanged. Without these, it is impossible to rule out that the improvement is an artifact of general capacity reduction rather than targeted removal of negation shortcuts.
  2. [Interpretability experiments] The section applying observational and causal techniques to distinguish suppressive vs. constructive mechanisms: because transformer attention is distributed, ablating or patching individual heads can produce effects consistent with both hypotheses simultaneously. The manuscript does not report head-level orthogonality tests, circuit decomposition, or controls that would demonstrate the interventions isolate one mechanism from the other; this directly undermines the ability to assert that the constructive mechanism is 'more prominent.'
minor comments (2)
  1. [Hypotheses section] Clarify the precise definition of 'constructive' vs. 'suppressive' representations with an example vector or activation pattern so readers can replicate the classification criteria.
  2. [Experimental setup] The datasets and exact negation-question templates used for accuracy measurements are not listed; include them (or a pointer to the release) to allow reproduction of the reported accuracy gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on our work. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / causal-intervention results] Abstract and the causal-intervention section: the claim that ablating late-layer attention modules 'greatly improves accuracy' on negation questions is load-bearing for the shortcut hypothesis, yet the abstract supplies no quantitative effect sizes, baseline comparisons, or controls showing that the same ablation leaves non-negation performance unchanged. Without these, it is impossible to rule out that the improvement is an artifact of general capacity reduction rather than targeted removal of negation shortcuts.

    Authors: We agree that the abstract would be strengthened by including quantitative details. The full manuscript reports ablation results with specific accuracy improvements on negation tasks alongside controls confirming stable performance on non-negation benchmarks. We will revise the abstract to incorporate these effect sizes, baseline comparisons, and non-negation controls to directly address concerns about general capacity reduction. revision: yes

  2. Referee: [Interpretability experiments] The section applying observational and causal techniques to distinguish suppressive vs. constructive mechanisms: because transformer attention is distributed, ablating or patching individual heads can produce effects consistent with both hypotheses simultaneously. The manuscript does not report head-level orthogonality tests, circuit decomposition, or controls that would demonstrate the interventions isolate one mechanism from the other; this directly undermines the ability to assert that the constructive mechanism is 'more prominent.'

    Authors: We acknowledge the distributed nature of attention and the resulting challenge in cleanly isolating mechanisms. Our interventions combine head-specific ablation (targeting suppressive attention patterns to negated phrases) with representation-level patching (targeting constructive negative phrase vectors), yielding convergent evidence from observational and causal methods that favors the constructive mechanism. We will add a dedicated discussion of potential cross-talk, any available orthogonality metrics, and a more qualified statement on prominence to reflect the limitations of isolation in distributed systems. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical interpretability findings are self-contained

full rationale

The paper's central claims rely on applying observational and causal interpretability methods (attention ablation, activation analysis) to Mistral-7B and Llama-3.1-8B to identify negation-processing mechanisms. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-citation chains are present that would reduce any result to its own inputs by construction. The coexistence and prominence of constructive vs. suppressive mechanisms are reported as outcomes of direct interventions on model internals, not as logical equivalences or ansatzes smuggled via prior self-work. This is a standard empirical analysis with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard mechanistic interpretability assumptions rather than new free parameters or invented entities.

axioms (2)
  • domain assumption Activation patching and attention ablation can causally isolate the contribution of specific heads or layers to a behavioral outcome.
    Invoked when claiming that ablating late-layer attention improves negation accuracy and that this reveals the shortcut mechanism.
  • domain assumption Observational techniques (e.g., attention visualization, representation similarity) combined with causal interventions can distinguish between suppression and constructive mechanisms.
    Used to conclude that both mechanisms are implemented and that construction is more prominent.

pith-pipeline@v0.9.0 · 5473 in / 1401 out tokens · 20059 ms · 2026-05-08T18:25:01.470535+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 27 canonical work pages · 5 internal anchors

  1. [1]

    2025 , eprint=

    Don't Think of the White Bear: Ironic Negation in Transformer Models Under Cognitive Load , author=. 2025 , eprint=

  2. [3]

    The Twelfth International Conference on Learning Representations , year=

    Function Vectors in Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=

  3. [4]

    The Twelfth International Conference on Learning Representations , year=

    On the Foundations of Shortcut Learning , author=. The Twelfth International Conference on Learning Representations , year=

  4. [5]

    Advances in Neural Information Processing Systems , volume=

    Pre-trained large language models use fourier features to compute addition , author=. Advances in Neural Information Processing Systems , volume=

  5. [7]

    Mechanistic?

    Saphra, Naomi and Wiegreffe, Sarah. Mechanistic?. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024. doi:10.18653/v1/2024.blackboxnlp-1.30

  6. [8]

    Distill , year =

    Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =

  7. [9]

    Mechanistic Interpretability Workshop at NeurIPS 2025 , year=

    RelP: Faithful and Efficient Circuit Discovery via Relevance Patching , author=. Mechanistic Interpretability Workshop at NeurIPS 2025 , year=

  8. [11]

    2023 , eprint=

    Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , eprint=

  9. [12]

    On the properties of neural machine translation: Encoder-decoder approaches

    Cho, Kyunghyun and van Merri. On the Properties of Neural Machine Translation: Encoder -- Decoder Approaches. Proceedings of SSST -8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. 2014. doi:10.3115/v1/W14-4012

  10. [13]

    Mechanistic Interpretability for

    Leonard Bereska and Stratis Gavves , journal=. Mechanistic Interpretability for. 2024 , url=

  11. [14]

    interpreting GPT: the logit lens , author=

  12. [15]

    The Twelfth International Conference on Learning Representations , year=

    Efficient Streaming Language Models with Attention Sinks , author=. The Twelfth International Conference on Learning Representations , year=

  13. [16]

    2020 , month = nov, journal =

    Shortcut learning in deep neural networks , volume =. Nature Machine Intelligence , author =. 2020 , pages =. doi:10.1038/s42256-020-00257-z , number =

  14. [17]

    Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks , year=

    Räuker, Tilman and Ho, Anson and Casper, Stephen and Hadfield-Menell, Dylan , booktitle=. Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks , year=

  15. [18]

    2025 , eprint=

    Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking , author=. 2025 , eprint=

  16. [19]

    The Thirteenth International Conference on Learning Representations , year=

    The Unreasonable Ineffectiveness of the Deeper Layers , author=. The Thirteenth International Conference on Learning Representations , year=

  17. [20]

    Goodfire Research , year =

    Aranguri, Santiago and McGrath, Tom , title =. Goodfire Research , year =

  18. [21]

    Proceedings of the 41st International Conference on Machine Learning , pages =

    Position: The Platonic Representation Hypothesis , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

  19. [22]

    Interpretability in the

    Wang, Kevin Ro and Variengien, Alexandre and Conmy, Arthur and Shlegeris, Buck and Steinhardt, Jacob , year =. Interpretability in the. The

  20. [23]

    Locating and Editing Factual Associations in GPT , url =

    Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle =. Locating and Editing Factual Associations in GPT , url =

  21. [24]

    2024 , eprint=

    Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs , author=. 2024 , eprint=

  22. [26]

    2024 , eprint=

    Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders , author=. 2024 , eprint=

  23. [28]

    Proceedings of the 41st International Conference on Machine Learning , pages =

    The Linear Representation Hypothesis and the Geometry of Large Language Models , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

  24. [29]

    2023 , journal=

    Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

  25. [30]

    The Thirteenth International Conference on Learning Representations , year=

    Not All Language Model Features Are One-Dimensionally Linear , author=. The Thirteenth International Conference on Learning Representations , year=

  26. [31]

    Distill , year =

    Olah, Chris and Satyanarayan, Arvind and Johnson, Ian and Carter, Shan and Schubert, Ludwig and Ye, Katherine and Mordvintsev, Alexander , title =. Distill , year =

  27. [32]

    Root Mean Square Layer Normalization , url =

    Zhang, Biao and Sennrich, Rico , booktitle =. Root Mean Square Layer Normalization , url =

  28. [33]

    2016 , eprint=

    Layer Normalization , author=. 2016 , eprint=

  29. [34]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  30. [36]

    Smith and Hannaneh Hajishirzi , booktitle=

    Evan Pete Walsh and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Shane Arora and Akshita Bhagia and Yuling Gu and Shengyi Huang and Matt Jordan and Nathan Lambert and Dustin Schwenk and Oyvind Tafjord and Taira Anderson and David Atkinson and Faeze Brahman and Christopher Clark and Pradeep Dasigi and Nouha Dziri and Allyson Ettinger and Michal Guerqu...

  31. [37]

    2023 , eprint=

    Mistral 7B , author=. 2023 , eprint=

  32. [38]

    2024 , eprint=

    Gemma 2: Improving Open Language Models at a Practical Size , author=. 2024 , eprint=

  33. [39]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  34. [42]

    2019 , eprint=

    RoBERTa: A Robustly Optimized BERT Pretraining Approach , author=. 2019 , eprint=

  35. [46]

    2001 , publisher=

    A mathematical introduction to logic , author=. 2001 , publisher=

  36. [47]

    2023 , archivePrefix=

    Localizing model behavior with path patching , author=. arXiv preprint arXiv:2304.05969 , year=

  37. [51]

    E., Hume, T., Carter, S., Henighan, T., and Olah, C

    Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C. Towards monosemanticity: Decomposing languag...

  38. [52]

    Chughtai, A

    Chughtai, B., Cooney, A., and Nanda, N. Summing up the facts: Additive mechanisms behind factual recall in llms, 2024. URL https://arxiv.org/abs/2402.07321

  39. [53]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. Sparse autoencoders find highly interpretable features in language models, 2023. URL https://arxiv.org/abs/2309.08600

  40. [54]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT : Pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short...

  41. [55]

    J., Liao, I., Gurnee, W., and Tegmark, M

    Engels, J., Michaud, E. J., Liao, I., Gurnee, W., and Tegmark, M. Not all language model features are one-dimensionally linear. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=d63a4AM4hb

  42. [56]

    Fisher, R. A. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7 0 (2): 0 179--188, 1936. doi:https://doi.org/10.1111/j.1469-1809.1936.tb02137.x. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1469-1809.1936.tb02137.x

  43. [57]

    Transformer Feed-Forward Layers Are Key-Value Memories

    Geva, M., Schuster, R., Berant, J., and Levy, O. Transformer feed-forward layers are key-value memories. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 5484--5495, Online and Punta Cana, Dominican Republic, November 2021. Association for Computa...

  44. [58]

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al

    Geva, M., Bastings, J., Filippova, K., and Globerson, A. Dissecting recall of factual associations in auto-regressive language models. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 12216--12235, Singapore, December 2023. Association for Computational Linguistics....

  45. [59]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

  46. [60]

    The unreasonable ineffectiveness of the deeper layers

    Gromov, A., Tirumala, K., Shapourian, H., Glorioso, P., and Roberts, D. The unreasonable ineffectiveness of the deeper layers. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=ngmEcEer8a

  47. [61]

    and Handschuh, S

    Gubelmann, R. and Handschuh, S. Context matters: A pragmatic study of PLM s' negation understanding. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 4602--4621, Dublin, Ireland, May 2022. Association for Computational Linguistics....

  48. [62]

    and Glucksberg, S

    Hasson, U. and Glucksberg, S. Does understanding negation entail affirmation?: An examination of negated metaphors. Journal of Pragmatics, 38 0 (7): 0 1015--1032, 2006. ISSN 0378-2166. doi:https://doi.org/10.1016/j.pragma.2005.12.005. URL https://www.sciencedirect.com/science/article/pii/S0378216606000051. Special Issue: Processes and Products of Negation

  49. [63]

    Stefan Heimersheim and Neel Nanda

    He, Z., Shu, W., Ge, X., Chen, L., Wang, J., Zhou, Y., Liu, F., Guo, Q., Huang, X., Wu, Z., Jiang, Y.-G., and Qiu, X. Llama scope: Extracting millions of features from llama-3.1-8b with sparse autoencoders, 2024. URL https://arxiv.org/abs/2410.20526

  50. [64]

    Hermann, K., Mobahi, H., FEL, T., and Mozer, M. C. On the foundations of shortcut learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Tj3xLVuE9f

  51. [65]

    R., Eberle, O., Khakzar, A., and Nanda, N

    Jafari, F. R., Eberle, O., Khakzar, A., and Nanda, N. Relp: Faithful and efficient circuit discovery via relevance patching. In Mechanistic Interpretability Workshop at NeurIPS 2025, 2025. URL https://openreview.net/forum?id=5PKPy82sWN

  52. [66]

    Mistral 7B

    Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mistral 7b, 2023. URL https://arxiv.org/abs/2310.06825

  53. [67]

    doi: 10.18653/v1/2020.acl-main.698

    Kassner, N. and Sch \"u tze, H. Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 7811--7818, Online, July 2020. Association for Computational Linguistic...

  54. [68]

    The self-contained negation test set

    Kletz, D., Amsili, P., and Candito, M. The self-contained negation test set. In Belinkov, Y., Hao, S., Jumelet, J., Kim, N., McCarthy, A., and Mohebbi, H. (eds.), Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pp.\ 212--221, Singapore, December 2023. Association for Computational Linguistics. doi:10.18653/...

  55. [69]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pretraining approach, 2019. URL https://arxiv.org/abs/1907.11692

  56. [70]

    Don’t think of the white bear: Ironic negation in transformer models under cognitive load

    Mann, L., Saxena, N., Tandon, S., Sun, C., Toteja, S., and Zhu, K. Don't think of the white bear: Ironic negation in transformer models under cognitive load, 2025. URL https://arxiv.org/abs/2511.12381

  57. [71]

    S., Conmy, A., Rushing, C., McGrath, T., and Nanda, N

    McDougall, C. S., Conmy, A., Rushing, C., McGrath, T., and Nanda, N. Copy suppression: Comprehensively understanding a motif in language model attention heads. In Belinkov, Y., Kim, N., Jumelet, J., Mohebbi, H., Mueller, A., and Chen, H. (eds.), Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pp.\ 337--363,...

  58. [72]

    Locating and editing factual associations in gpt

    Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating and editing factual associations in gpt. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp.\ 17359--17372. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file...

  59. [73]

    interpreting gpt: the logit lens, 2020

    nostalgebraist. interpreting gpt: the logit lens, 2020. URL https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

  60. [74]

    DOI:10.23915/distill.00010

    Olah, C., Satyanarayan, A., Johnson, I., Carter, S., Schubert, L., Ye, K., and Mordvintsev, A. The building blocks of interpretability. Distill, 2018. doi:10.23915/distill.00010. https://distill.pub/2018/building-blocks

  61. [75]

    The default computation of negated meanings

    Papeo, L., Hochmann, J.-R., and Battelli, L. The default computation of negated meanings. Journal of Cognitive Neuroscience, 28 0 (12): 0 1980--1986, 12 2016. ISSN 0898-929X. doi:10.1162/jocn_a_01016. URL https://doi.org/10.1162/jocn_a_01016

  62. [76]

    Gemma 2: Improving Open Language Models at a Practical Size

    Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ramé, A., Ferret, J., Liu, P., Tafti, P., Friesen, A., Casbon, M., Ramos, S., Kumar, R., Lan, C. L., Jerome, S., Tsitsulin, A., Vieillard, N., Stanczyk, P., Girgin, S., Momchev, N., Hoffman, M., Thakoor, S., Grill, J.-B., Neyshabur, B., Bachem, O....

  63. [77]

    S., Mueller, A., Wallace, B

    Todd, E., Li, M., Sharma, A. S., Mueller, A., Wallace, B. C., and Bau, D. Function vectors in large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=AwyxtyMwaG

  64. [78]

    N., Kaiser, L

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proc...

  65. [79]

    Walsh, E. P., Soldaini, L., Groeneveld, D., Lo, K., Arora, S., Bhagia, A., Gu, Y., Huang, S., Jordan, M., Lambert, N., Schwenk, D., Tafjord, O., Anderson, T., Atkinson, D., Brahman, F., Clark, C., Dasigi, P., Dziri, N., Ettinger, A., Guerquin, M., Heineman, D., Ivison, H., Koh, P. W., Liu, J., Malik, S., Merrill, W., Miranda, L. J. V., Morrison, J., Murra...

  66. [80]

    R., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J

    Wang, K. R., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. Interpretability in the Wild : a Circuit for Indirect Object Identification in GPT -2 Small . In The Eleventh International Conference on Learning Representations , 2023. URL https://openreview.net/forum?id=NpsVSN6o4ul

  67. [81]

    Efficient streaming language models with attention sinks

    Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=NG7sS51zVF

  68. [82]

    Yan, T. L. and Jia, R. Promote, suppress, iterate: How language models answer one-to-many factual queries. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 16111--16134, Suzhou, China, November 2025. Association for Computational Linguist...

  69. [83]

    Qwen2.5 Technical Report

    Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wan, Y., Liu, Y., Cui...

  70. [84]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...

  71. [85]

    and Sennrich, R

    Zhang, B. and Sennrich, R. Root mean square layer normalization. In Wallach, H., Larochelle, H., Beygelzimer, A., d Alch\' e -Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/1e8a19426224ca89e83cef47f1e7f53b-...

  72. [86]

    PLOS Biology , author =

    Zuanazzi, A., Ripollés, P., Lin, W. M., Gwilliams, L., King, J.-R., and Poeppel, D. Negation mitigates rather than inverts the neural representations of adjectives. PLOS Biology, 22 0 (5): 0 1--33, 05 2024. doi:10.1371/journal.pbio.3002622. URL https://doi.org/10.1371/journal.pbio.3002622