pith. machine review for the scientific record. sign in

arxiv: 2605.08453 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI· stat.ML

Recognition: no theorem link

Sink vs. diagonal patterns as mechanisms for attention switch and oversmoothing prevention

Christoph H. Lampert, Cristina L\'opez Amado, Marco Mondelli, Peter S\'uken\'ik

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:10 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords attention sinksoversmoothingattention mechanismstransformershard attention switchdiagonal patternsself-attention
0
0 comments X

The pith

Sinks in attention are equivalent to hard switches that zero the output and cost less to represent than diagonal patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes geometric conditions under which attention sinks form in transformer models and proves these sinks are equivalent to a hard attention switch that makes the output identically zero. It refines the link between sinks and oversmoothing prevention by specifying when dense attention smooths more than sparse attention, a condition often met in practice. By relaxing the switch to allow token self-communication and comparing representation costs, the work shows sinks are favored over diagonal patterns in pretrained transformers. This matters because it explains observed attention behaviors and clarifies when attention layers function like MLPs without needing inter-token communication. The introduction of diagonal patterns closes the gap between what oversmoothing prevention requires and what sinks actually deliver.

Core claim

Sinks can be represented under a necessary geometric alignment between the sink embedding and all other embeddings. Under this alignment sinks are equivalent to hard attention switches that set the attention output to identically zero. When self-communication is allowed, the cost of representing sinks is lower than the cost of representing diagonal patterns, which accounts for the prevalence of sinks in pretrained transformers and shows why attention layers sometimes act like MLPs.

What carries the argument

The equivalence between sinks and hard attention switch (output identically zero), together with the quantitative cost comparison to diagonal patterns.

Load-bearing premise

The sink embedding must be aligned with all other token embeddings for the geometric representation conditions and cost comparison to hold.

What would settle it

Finding a trained transformer in which a sink token is present but the attention output is not identically zero, or in which the representation cost of diagonal patterns is lower than that of sinks.

Figures

Figures reproduced from arXiv: 2605.08453 by Christoph H. Lampert, Cristina L\'opez Amado, Marco Mondelli, Peter S\'uken\'ik.

Figure 1
Figure 1. Figure 1: Cosine similarity between BOS and all other 256 token embeddings at the input to the [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left: empirical average cosine similarity (solid) and theoretical approximation (2) (dashed) for X (first plot) and X + W Z (second plot), across models and context lengths. Right: empirical average cosine similarity in LLaMA3-8B for different components of the attention step, averaging over all heads (first plot) and over heads with uniformity coefficient (see (12) in Appendix E.1) larger than 0.6 (second… view at source ↗
Figure 3
Figure 3. Figure 3: Frobenius squared cost of a single transformer block if the attention pattern is either sink or [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Evolution of average cosine similarity across layers for different models, datasets, and [PITH_FULL_IMAGE:figures/full_fig_p031_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Empirical average cosine similarity (solid) and theoretical approximation (2) (dashed) for [PITH_FULL_IMAGE:figures/full_fig_p031_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Empirical average cosine similarity for different models, datasets and context lengths for [PITH_FULL_IMAGE:figures/full_fig_p032_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Absolute difference between the empirical average cosine similarity and the theoretical [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: avg. cos sim(X + W ZA⊤) − avg. cos sim(X + W Z) for each head of different models and datasets at context length 512. Heads satisfying β tr(BW) > 0 (red) tend to yield positive values, whereas heads satisfying β tr(BW) + tr(BW⊤W) < 0 (blue) tend to have negative values. Heads satisfying neither condition are shown in gray. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: avg. cos sim(X + W ZA⊤ u ) − avg. cos sim(X + W ZA⊤) for each head of different models and datasets at context length 512. Heads satisfying β tr(BW) > 0 (red) tend to yield positive values, whereas heads satisfying β tr(BW) + tr(BW⊤W) < 0 (blue) tend to have negative values. Heads satisfying neither condition are shown in gray. E.3 Attention heads following the patterns from Section 7 [PITH_FULL_IMAGE:fig… view at source ↗
Figure 10
Figure 10. Figure 10: Relative change in average cosine similarity, [PITH_FULL_IMAGE:figures/full_fig_p036_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Empirical average cosine similarity for selected heads from different models and datasets, [PITH_FULL_IMAGE:figures/full_fig_p037_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Rank distributions across layers, pooled over heads, 50 sequences and 4 datasets. The [PITH_FULL_IMAGE:figures/full_fig_p038_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Attention patterns in pretrained transformers (taken randomly) from a pool of models [PITH_FULL_IMAGE:figures/full_fig_p039_13.png] view at source ↗
read the original abstract

This paper studies the role of sinks and diagonal patterns as attention switch and anti-oversmoothing mechanisms. We analyze geometric conditions under which sinks can be represented, showing a necessary alignment between the embedding of the sink and all other embeddings. Next, we refine the current understanding of the role of sinks in oversmoothing prevention: we specify the conditions under which dense attention provably smooths more than sparse attention, and empirically verify that such conditions are often satisfied in practice. We further prove an equivalence between sinks and hard attention switch, in which the output of the attention is identically 0. Finally, we relax the hard attention switch by allowing token self-communication: we provide a quantitative comparison of the costs of representing sinks vs.\ diagonal patterns, showing why sinks are favored in pretrained transformers. The introduction and analysis of diagonal patterns and the generalization of the attention switch close the gap between what oversmoothing prevention requires and what sinks provide, while also establishing when and why attention layers act like MLPs if token communication is not necessary.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper analyzes sinks and diagonal patterns in transformer attention as mechanisms for attention switching and oversmoothing prevention. It derives geometric conditions for sink representation (requiring alignment between the sink embedding and all other token embeddings), proves an equivalence between sinks and hard attention switches (where attention output is identically zero), specifies and empirically verifies conditions under which dense attention smooths more than sparse attention, and after relaxing the hard switch to allow self-communication, provides a quantitative cost comparison showing why sinks are favored over diagonal patterns in pretrained transformers. The work also discusses when attention layers behave like MLPs if inter-token communication is unnecessary.

Significance. If the geometric conditions and equivalences hold in practice, the paper offers a precise theoretical account of observed attention patterns, clarifies the gap between oversmoothing-prevention requirements and what sinks deliver, and explains the empirical preference for sinks via representational cost. The introduction of diagonal patterns as an alternative and the relaxation to self-communication are useful conceptual advances; the empirical checks on smoothing conditions add practical grounding.

major comments (3)
  1. [§3] §3 (geometric conditions for sinks): The necessary alignment between the sink token embedding and every other embedding is required for both the equivalence proof and the subsequent cost comparison; the manuscript does not report any empirical measurement of this alignment in pretrained models, leaving the applicability of the main claims conditional on an unverified assumption.
  2. [§4] §4 (equivalence to hard attention switch): Theorem establishing that sink attention yields output identically zero holds only under the alignment condition of Assumption 2; without evidence that this geometry is realized in trained transformers, the claimed equivalence does not yet explain attention behavior in practice.
  3. [§5] §5 (cost comparison after relaxation): The quantitative argument favoring sinks over diagonal patterns inherits the same alignment precondition; if the alignment is weak or absent, the cost analysis does not account for why pretrained models prefer sinks.
minor comments (2)
  1. [§3.2] The definition of 'dense' versus 'sparse' attention in the oversmoothing comparison could be stated more explicitly with reference to the attention matrix sparsity pattern.
  2. [Figures 2-4] Figure captions should indicate whether the plotted attention maps are from a specific layer or averaged across layers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight the need for stronger empirical grounding of our theoretical results on alignment conditions. We address each major comment below and will revise the manuscript accordingly to incorporate additional empirical analysis where feasible.

read point-by-point responses
  1. Referee: [§3] §3 (geometric conditions for sinks): The necessary alignment between the sink token embedding and every other embedding is required for both the equivalence proof and the subsequent cost comparison; the manuscript does not report any empirical measurement of this alignment in pretrained models, leaving the applicability of the main claims conditional on an unverified assumption.

    Authors: We agree that the alignment condition is foundational to the geometric analysis, equivalence proof, and cost comparison. The manuscript explicitly derives this as a necessary condition rather than assuming it holds universally. To address applicability, the revised manuscript will include new empirical measurements: we will compute average cosine similarities (and other alignment metrics) between sink token embeddings and other token embeddings across pretrained models (e.g., BERT, GPT-2) on standard benchmarks. This addition will quantify how often the condition is approximately satisfied in practice. revision: yes

  2. Referee: [§4] §4 (equivalence to hard attention switch): Theorem establishing that sink attention yields output identically zero holds only under the alignment condition of Assumption 2; without evidence that this geometry is realized in trained transformers, the claimed equivalence does not yet explain attention behavior in practice.

    Authors: The theorem is stated with the alignment assumption (Assumption 2), as required for the output to be identically zero. We will revise §4 to explicitly restate this precondition and add a discussion of its implications. Additionally, the new empirical alignment measurements (as noted in response to §3) will be cross-referenced here to assess the practical relevance of the equivalence. If alignment is only approximate, we will note that the hard-switch behavior may hold approximately, consistent with the paper's later relaxation to self-communication. revision: yes

  3. Referee: [§5] §5 (cost comparison after relaxation): The quantitative argument favoring sinks over diagonal patterns inherits the same alignment precondition; if the alignment is weak or absent, the cost analysis does not account for why pretrained models prefer sinks.

    Authors: The cost comparison in the relaxed setting (allowing self-communication) is designed to close the gap between oversmoothing prevention and what sinks provide, but we acknowledge it builds on the geometric setup. In revision, we will clarify the dependence on alignment and integrate the new empirical results to evaluate whether the cost advantage persists under observed alignments in pretrained models. If alignment is weak, we will discuss alternative explanations (e.g., optimization dynamics) while preserving the quantitative comparison as a theoretical contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivations are self-contained geometric proofs and empirical checks

full rationale

The paper's central results consist of explicit geometric conditions (necessary alignment of sink embeddings), formal equivalence proofs between sinks and hard attention switches (output identically zero), conditions for dense vs. sparse smoothing, and a quantitative cost comparison after relaxing to allow self-communication. These steps are presented as mathematical derivations and empirical verifications of stated assumptions rather than reductions to fitted parameters, self-definitional loops, or load-bearing self-citations. No quoted step renames a known result as a new prediction or imports uniqueness via prior author work as an unverified axiom. The alignment precondition is openly declared as necessary, so the equivalence and cost claims are conditional rather than smuggled. The derivation chain remains independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard attention definitions and geometric embedding assumptions; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Attention output can be made identically zero under certain embedding alignments
    Invoked in the proof of equivalence between sinks and hard attention switch

pith-pipeline@v0.9.0 · 5493 in / 1163 out tokens · 47317 ms · 2026-05-12T02:10:16.801555+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 2 internal anchors

  1. [1]

    Bronstein

    Federico Barbero, Alvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Petar Veliˇckovi´c, Razvan Pascanu, and Michael M. Bronstein. Why do LLMs attend to the first token? InSecond Conference on Language Modeling, 2025

  2. [2]

    Transform- ers need glasses! information over-squashing in language tasks

    Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran, João Madeira Araújo, Oleksandr Vitvitskyi, Razvan Pascanu, and Petar Veli ˇckovi´c. Transform- ers need glasses! information over-squashing in language tasks. InConference on Neural Information Processing Systems (NeurIPS), 2024

  3. [3]

    Recurrent neural networks for adaptive temporal processing

    Yoshua Bengio, Paolo Frasconi, and Marco Gori. Recurrent neural networks for adaptive temporal processing. InProceedings of the 6th Italian Workshop on Parallel Architectures and Neural Networks WIRN93, pages 85–117, 1993

  4. [4]

    Quantizable transformers: Removing outliers by helping attention heads do nothing

    Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Quantizable transformers: Removing outliers by helping attention heads do nothing. InConference on Neural Information Processing Systems (NeurIPS), 2023

  5. [5]

    Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, and Wen Xiao. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. InFirst Conference on Language Modeling, 2025

  6. [6]

    Vision transformers need registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InInternational Conference on Learning Representations (ICLR), 2024

  7. [7]

    Attention is not all you need: Pure attention loses rank doubly exponentially with depth

    Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. InInternational Conference on Machine Learning (ICML), 2021

  8. [8]

    Bronstein, and Matt Kusner

    Gbetondji Jean-Sebastien Dovonon, Michael M. Bronstein, and Matt Kusner. Setting the record straight on transformer oversmoothing. InTransactions on Machine Learning Research (TMLR), 2025

  9. [9]

    Tinystories: How small can language models be and still speak coherent english?, 2023

    Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english?, 2023

  10. [10]

    The emergence of clusters in self-attention dynamics

    Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics. InConference on Neural Information Processing Systems (NeurIPS), 2023

  11. [11]

    Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation

    Alessio Giorlandino and Sebastian Goldt. Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation.arXiv preprint arXiv:2505.24333, 2025

  12. [12]

    When attention sink emerges in language models: An empirical view

    Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view. InInternational Conference on Learning Representations (ICLR), 2025. 10

  13. [13]

    Jordan, and Song Mei

    Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I. Jordan, and Song Mei. Active-dormant attention heads: Mechanistically demystifying extreme-token phenomena in LLMs. InThe Second Conference on Parsimony and Learning (Recent Spotlight Track), 2025

  14. [14]

    Stochastic parroting in temporal attention–regulating the diagonal sink.arXiv preprint arXiv:2602.10956, 2026

    Victoria Hankemeier and Malte Schilling. Stochastic parroting in temporal attention–regulating the diagonal sink.arXiv preprint arXiv:2602.10956, 2026

  15. [15]

    CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

    Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. CodeSearchNet challenge: Evaluating the state of semantic code search.arXiv preprint arXiv:1909.09436, 2019

  16. [16]

    Motifs in attention patterns of large language models

    Michael Ivanitskiy, Cecilia Diniz Behn, and Samy Wu Fung. Motifs in attention patterns of large language models. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025

  17. [17]

    See what you are told: Visual attention sink in large multimodal models

    Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models. InInternational Conference on Learning Representations (ICLR), 2025

  18. [18]

    Clustering in causal attention masking

    Nikita Karagodin, Yury Polyanskiy, and Philippe Rigollet. Clustering in causal attention masking. InConference on Neural Information Processing Systems (NeurIPS), 2024

  19. [19]

    char-rnn.https://github.com/karpathy/char-rnn, 2015

    Andrej Karpathy. char-rnn.https://github.com/karpathy/char-rnn, 2015

  20. [20]

    Weight decay induces low-rank attention layers

    Seijin Kobayashi, Yassir Akram, and Johannes V on Oswald. Weight decay induces low-rank attention layers. InConference on Neural Information Processing Systems (NeurIPS), 2024

  21. [21]

    Streamingdialogue: Prolonged dialogue learning via long context compression with minimal losses

    Jia-Nan Li, Quan Tu, Cunli Mao, Zhengtao Yu, Ji-Rong Wen, and Rui Yan. Streamingdialogue: Prolonged dialogue learning via long context compression with minimal losses. InConference on Neural Information Processing Systems (NeurIPS), 2024

  22. [22]

    To sink or not to sink: Visual information pathways in large vision-language models

    Jiayun Luo, Wan-Cyuan Fan, Lyuyang Wang, Xiangteng He, Tanzila Rahman, Purang Abol- maesumi, and Leonid Sigal. To sink or not to sink: Visual information pathways in large vision-language models. InInternational Conference on Learning Representations (ICLR), 2026

  23. [23]

    Pointer sentinel mixture models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InInternational Conference on Learning Representations, 2017

  24. [24]

    Pointer sentinel mixture models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InInternational Conference on Learning Representations (ICLR), 2017

  25. [25]

    Signal propagation in transformers: Theoretical perspectives and the role of rank collapse

    Lorenzo Noci, Sotiris Anagnostidis, Luca Biggio, Antonio Orvieto, Sidak Pal Singh, and Aurelien Lucchi. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. InConference on Neural Information Processing Systems (NeurIPS), 2022

  26. [26]

    A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training.arXiv preprint arXiv: 2601.22966,

    Zihan Qiu, Zeyu Huang, Kaiyue Wen, Peng Jin, Bo Zheng, Yuxin Zhou, Haofeng Huang, Zekun Wang, Xiao Li, Huaqing Zhang, et al. A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training.arXiv preprint arXiv:2601.22966, 2026

  27. [27]

    Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. InConference on Neural Information Processing Systems (NeurIPS), 2025

  28. [28]

    Attention sinks and compression valleys in llms are two sides of the same coin.arXiv preprint arXiv:2510.06477,

    Enrique Queipo-de Llano, Álvaro Arroyo, Federico Barbero, Xiaowen Dong, Michael Bronstein, Yann LeCun, and Ravid Shwartz-Ziv. Attention sinks and compression valleys in llms are two sides of the same coin.arXiv preprint arXiv:2510.06477, 2025

  29. [29]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. 11

  30. [30]

    Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks

    Yuval Ran-Milo. Attention sinks are provably necessary in softmax transformers: Evidence from trigger-conditional tasks.arXiv preprint arXiv:2603.11487, 2026

  31. [31]

    E., Petruzzi, S., Michielon, E., Silvestri, F., Scarda- pane, S., and Devoto, A

    Maximo Eduardo Rulli, Simone Petruzzi, Edoardo Michielon, Fabrizio Silvestri, Simone Scardapane, and Alessio Devoto. Attention sinks in diffusion language models.arXiv preprint arXiv:2510.15731, 2025

  32. [32]

    What are you sinking? a geometric approach on attention sink

    Valeria Ruscio, Umberto Nanni, Fabrizio Silvestri, et al. What are you sinking? a geometric approach on attention sink. InConference on Neural Information Processing Systems (NeurIPS), 2025

  33. [33]

    Mind the gap: a spectral analysis of rank collapse and signal propagation in attention layers.arXiv preprint arXiv:2410.07799, 2024

    Thiziri Nait Saada, Alireza Naderi, and Jared Tanner. Mind the gap: a spectral analysis of rank collapse and signal propagation in attention layers.arXiv preprint arXiv:2410.07799, 2024

  34. [34]

    Transformers, parallel computation, and logarithmic depth

    Clayton Sanford, Daniel Hsu, and Matus Telgarsky. Transformers, parallel computation, and logarithmic depth. InInternational Conference on Machine Learning (ICML), 2024

  35. [35]

    The graph neural network model.IEEE transactions on neural networks, 20(1):61–80, 2008

    Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model.IEEE transactions on neural networks, 20(1):61–80, 2008

  36. [36]

    Han Shi, JIAHUI GAO, Hang Xu, Xiaodan Liang, Zhenguo Li, Lingpeng Kong, Stephen M. S. Lee, and James Kwok. Revisiting over-smoothing in BERT from the perspective of graph. In International Conference on Learning Representations (ICLR), 2022

  37. [37]

    Attention reallocation: Towards zero-cost and controllable hallucination mitigation of mllms

    Chongjun Tu, Peng Ye, Dongzhan Zhou, Lei Bai, Gang Yu, Tao Chen, and Wanli Ouyang. Attention reallocation: Towards zero-cost and controllable hallucination mitigation of mllms. International Journal of Computer Vision, 134(1):22, 2026

  38. [38]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. InConference on Neural Information Processing Systems (NeurIPS), 2017

  39. [39]

    Anti-oversmoothing in deep vision transformers via the fourier domain analysis: From theory to practice

    Peihao Wang, Wenqing Zheng, Tianlong Chen, and Zhangyang Wang. Anti-oversmoothing in deep vision transformers via the fourier domain analysis: From theory to practice. In International Conference on Learning Representations (ICLR), 2022

  40. [40]

    Mirage in the eyes: Hallucination attack on multi-modal large language models with only attention sink

    Yining Wang, Mi Zhang, Junjie Sun, Chenyue Wang, Min Yang, Hui Xue, Jialing Tao, Ranjie Duan, and Jiexi Liu. Mirage in the eyes: Hallucination attack on multi-modal large language models with only attention sink. InUSENIX Security Symposium, 2025

  41. [41]

    Anal- ysis of attention in video diffusion transformers.arXiv preprint arXiv:2504.10317,

    Yuxin Wen, Jim Wu, Ajay Jain, Tom Goldstein, and Ashwinee Panda. Analysis of attention in video diffusion transformers.arXiv preprint arXiv:2504.10317, 2025

  42. [42]

    On the role of attention masks and layernorm in transformers

    Xinyi Wu, Amir Ajorlou, Yifei Wang, Stefanie Jegelka, and Ali Jadbabaie. On the role of attention masks and layernorm in transformers. InConference on Neural Information Processing Systems (NeurIPS), 2024

  43. [43]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations (ICLR), 2024

  44. [44]

    Unveiling and harnessing hidden attention sinks: Enhancing large language models without training through attention calibration

    Zhongzhi Yu, Zheng Wang, Yonggan Fu, Huihong Shi, Khalid Shaikh, and Yingyan Celine Lin. Unveiling and harnessing hidden attention sinks: Enhancing large language models without training through attention calibration. InInternational Conference on Machine Learning (ICML), 2024

  45. [45]

    Exclusive self attention.arXiv preprint arXiv:2603.09078, 2026

    Shuangfei Zhai. Exclusive self attention.arXiv preprint arXiv:2603.09078, 2026

  46. [46]

    catch, tag, and release

    Stephen Zhang, Mustafa Khan, and Vardan Papyan. Attention sinks and outlier features: A “catch, tag, and release” mechanism for embeddings. InConference on Neural Information Processing Systems (NeurIPS), 2025. 12 A Proofs for Section 4 Proposition 4.1.Let Z∈R d×(T+1) be a matrix of input embeddingsz0, z1, . . . , zT , and let J ⊂[T] be a set of indices. ...

  47. [47]

    We distinguish two cases: (i) βtr(BW)>0

    Since D′ k(1) =−2β 1− 1 k tr(BW) and D′ k(0) =−2 1− 1 k (βtr(BW) + tr(BW ⊤W)) , Dk(λ) strictly decreases iff βtr(BW)>0 and strictly increases iff βtr(BW) + tr(BW ⊤W)<0 . We distinguish two cases: (i) βtr(BW)>0 . Then, E[Yi(λ)⊤Yj(λ)] strictly increases with λ and Dk(λ) strictly decreases. Since both numerator and denominator are assumed to be positive, we ...

  48. [48]

    Sink upper bound.There exists a sink solution with total cost at most Usink ≤12 κC√ d + 5(rη +C)

  49. [49]

    Diagonal lower bound.Every diagonal solution has total cost at least Ldiag ≥ κ δ2 √ d reff(ΣD) + κC 2 √ d . 22

  50. [50]

    Diagonal upper bound.There exists a diagonal solution with total cost at most Udiag ≤13 κC√ d + 4κ √ d ∆diag rη + 3(rη +C)

  51. [51]

    Proof.We prove the four bounds one by one

    Sink lower bound.Every sink solution has total cost at least Lsink ≥ κC 2 √ d . Proof.We prove the four bounds one by one

  52. [52]

    Step 1: query-key construction.Because every nuisance vector ηci is orthogonal to S, we have U ⊤ηci = 0and thereforeΠη ci = 0

    Sink upper bound.We construct an explicit sink solution. Step 1: query-key construction.Because every nuisance vector ηci is orthogonal to S, we have U ⊤ηci = 0and thereforeΠη ci = 0. Hence Πdci = Π(λ ¯dc +η ci) =λ √ d ec. Similarly, ΠbBOS = √ d e0,Πc c = √ d eC+c. Letr:= (0,1, . . . ,1) ⊤ ∈R m. Define M:=a re ⊤ 0 +b CX c=1 eC+ce⊤ c , W QK := Π⊤MΠ. We cho...

  53. [53]

    Step 1: dormant self-attention cost.Fix a group c and a pair {i, j} ∈ P c

    Diagonal lower bound.This lower bound only uses attention-map constraints, so the relaxed value labels do not affect it. Step 1: dormant self-attention cost.Fix a group c and a pair {i, j} ∈ P c. Since the diagonal pattern must work for both relative orders, we have d⊤ ciWQK dci −d ⊤ ciWQK dcj√ d ≥κ and d⊤ cjWQK dcj −d ⊤ cjWQK dci √ d ≥κ. 25 Adding the tw...

  54. [54]

    LetP η be the orthogonal projector onto the nuisance span

    Diagonal upper bound.We construct an explicit diagonal solution. LetP η be the orthogonal projector onto the nuisance span. Define W diag QK = Π⊤MdiagΠ +α diagPη, where Mdiag =s CX c=1 ece⊤ c +a CX c=1 eC+ce⊤ C+c +b CX c=1 eC+ce⊤ c , with s:= κ λ2 √ d , a:= κ√ d , b:= 2κ λ √ d , α diag := 2κ √ d ∆diag . We now verify the attention scores induced by this c...

  55. [55]

    Sink lower bound.This lower bound only uses copy-routing constraints in the sink pattern. The copy-paste queryc c must make the correct group have score larger than BOS byκ: c⊤ c WQK(λ ¯dc −b BOS)√ d ≥κ, and it must make the correct group have score larger than any wrong dormant group byκ: c⊤ c WQK(λ ¯dc −λ ¯dc′)√ d ≥κ(c ′ ̸=c). For a fixedc, summing the ...