arxiv: 2605.08453 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI· stat.ML

Recognition: no theorem link

Sink vs. diagonal patterns as mechanisms for attention switch and oversmoothing prevention

Christoph H. Lampert, Cristina L\'opez Amado, Marco Mondelli, Peter S\'uken\'ik

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:10 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords attention sinksoversmoothingattention mechanismstransformershard attention switchdiagonal patternsself-attention

0 comments

The pith

Sinks in attention are equivalent to hard switches that zero the output and cost less to represent than diagonal patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes geometric conditions under which attention sinks form in transformer models and proves these sinks are equivalent to a hard attention switch that makes the output identically zero. It refines the link between sinks and oversmoothing prevention by specifying when dense attention smooths more than sparse attention, a condition often met in practice. By relaxing the switch to allow token self-communication and comparing representation costs, the work shows sinks are favored over diagonal patterns in pretrained transformers. This matters because it explains observed attention behaviors and clarifies when attention layers function like MLPs without needing inter-token communication. The introduction of diagonal patterns closes the gap between what oversmoothing prevention requires and what sinks actually deliver.

Core claim

Sinks can be represented under a necessary geometric alignment between the sink embedding and all other embeddings. Under this alignment sinks are equivalent to hard attention switches that set the attention output to identically zero. When self-communication is allowed, the cost of representing sinks is lower than the cost of representing diagonal patterns, which accounts for the prevalence of sinks in pretrained transformers and shows why attention layers sometimes act like MLPs.

What carries the argument

The equivalence between sinks and hard attention switch (output identically zero), together with the quantitative cost comparison to diagonal patterns.

Load-bearing premise

The sink embedding must be aligned with all other token embeddings for the geometric representation conditions and cost comparison to hold.

What would settle it

Finding a trained transformer in which a sink token is present but the attention output is not identically zero, or in which the representation cost of diagonal patterns is lower than that of sinks.

Figures

Figures reproduced from arXiv: 2605.08453 by Christoph H. Lampert, Cristina L\'opez Amado, Marco Mondelli, Peter S\'uken\'ik.

**Figure 2.** Figure 2: Left: empirical average cosine similarity (solid) and theoretical approximation (2) (dashed) for X (first plot) and X + W Z (second plot), across models and context lengths. Right: empirical average cosine similarity in LLaMA3-8B for different components of the attention step, averaging over all heads (first plot) and over heads with uniformity coefficient (see (12) in Appendix E.1) larger than 0.6 (second… view at source ↗

**Figure 3.** Figure 3: Frobenius squared cost of a single transformer block if the attention pattern is either sink or [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Evolution of average cosine similarity across layers for different models, datasets, and [PITH_FULL_IMAGE:figures/full_fig_p031_4.png] view at source ↗

**Figure 5.** Figure 5: Empirical average cosine similarity (solid) and theoretical approximation (2) (dashed) for [PITH_FULL_IMAGE:figures/full_fig_p031_5.png] view at source ↗

**Figure 6.** Figure 6: Empirical average cosine similarity for different models, datasets and context lengths for [PITH_FULL_IMAGE:figures/full_fig_p032_6.png] view at source ↗

**Figure 7.** Figure 7: Absolute difference between the empirical average cosine similarity and the theoretical [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗

**Figure 8.** Figure 8: avg. cos sim(X + W ZA⊤) − avg. cos sim(X + W Z) for each head of different models and datasets at context length 512. Heads satisfying β tr(BW) > 0 (red) tend to yield positive values, whereas heads satisfying β tr(BW) + tr(BW⊤W) < 0 (blue) tend to have negative values. Heads satisfying neither condition are shown in gray. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗

**Figure 9.** Figure 9: avg. cos sim(X + W ZA⊤ u ) − avg. cos sim(X + W ZA⊤) for each head of different models and datasets at context length 512. Heads satisfying β tr(BW) > 0 (red) tend to yield positive values, whereas heads satisfying β tr(BW) + tr(BW⊤W) < 0 (blue) tend to have negative values. Heads satisfying neither condition are shown in gray. E.3 Attention heads following the patterns from Section 7 [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 10.** Figure 10: Relative change in average cosine similarity, [PITH_FULL_IMAGE:figures/full_fig_p036_10.png] view at source ↗

**Figure 11.** Figure 11: Empirical average cosine similarity for selected heads from different models and datasets, [PITH_FULL_IMAGE:figures/full_fig_p037_11.png] view at source ↗

**Figure 12.** Figure 12: Rank distributions across layers, pooled over heads, 50 sequences and 4 datasets. The [PITH_FULL_IMAGE:figures/full_fig_p038_12.png] view at source ↗

**Figure 13.** Figure 13: Attention patterns in pretrained transformers (taken randomly) from a pool of models [PITH_FULL_IMAGE:figures/full_fig_p039_13.png] view at source ↗

read the original abstract

This paper studies the role of sinks and diagonal patterns as attention switch and anti-oversmoothing mechanisms. We analyze geometric conditions under which sinks can be represented, showing a necessary alignment between the embedding of the sink and all other embeddings. Next, we refine the current understanding of the role of sinks in oversmoothing prevention: we specify the conditions under which dense attention provably smooths more than sparse attention, and empirically verify that such conditions are often satisfied in practice. We further prove an equivalence between sinks and hard attention switch, in which the output of the attention is identically 0. Finally, we relax the hard attention switch by allowing token self-communication: we provide a quantitative comparison of the costs of representing sinks vs.\ diagonal patterns, showing why sinks are favored in pretrained transformers. The introduction and analysis of diagonal patterns and the generalization of the attention switch close the gap between what oversmoothing prevention requires and what sinks provide, while also establishing when and why attention layers act like MLPs if token communication is not necessary.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sinks are lower-cost than diagonals for attention switching once self-communication is allowed, but only under a geometric alignment that may not hold in trained models.

read the letter

The main thing to know is that this paper shows sinks can act as hard attention switches with lower representation cost than diagonal patterns, once the model is allowed to let tokens communicate with themselves, but the whole argument rests on a necessary alignment between the sink embedding and every other embedding. They prove equivalence between sinks and the case where attention output is identically zero, then relax the switch and compare costs quantitatively to explain why pretrained transformers favor sinks. They also give geometric conditions for when sinks can be represented and specify when dense attention smooths more than sparse attention, with an empirical check that those conditions often hold in practice. The introduction of diagonal patterns and the cost comparison are genuinely new relative to the cited prior work on oversmoothing and attention switches. The geometric analysis and the relaxation to self-communication are the parts that feel most solid; they tighten the link between theory and the observed sink behavior. The empirical verification helps, though it would be useful to know exactly which models and layers were checked. The soft spot is the alignment precondition. If trained embeddings do not satisfy the necessary alignment, both the equivalence and the cost advantage for sinks become inapplicable even if the formal derivations are correct inside the assumed geometry. The abstract presents this alignment as necessary rather than optional, so the concern is not minor. This paper is for people who work on transformer internals, oversmoothing theory, and mechanistic explanations of attention. Readers who already follow papers on sink tokens or attention patterns will get the clearest value from the cost comparison and the conditions for when attention layers start acting like MLPs. It has enough formal content and a clear connection to practice to deserve serious referee time, though the alignment issue will probably need more discussion or checks in revision. I would send it to peer review.

Referee Report

3 major / 2 minor

Summary. The paper analyzes sinks and diagonal patterns in transformer attention as mechanisms for attention switching and oversmoothing prevention. It derives geometric conditions for sink representation (requiring alignment between the sink embedding and all other token embeddings), proves an equivalence between sinks and hard attention switches (where attention output is identically zero), specifies and empirically verifies conditions under which dense attention smooths more than sparse attention, and after relaxing the hard switch to allow self-communication, provides a quantitative cost comparison showing why sinks are favored over diagonal patterns in pretrained transformers. The work also discusses when attention layers behave like MLPs if inter-token communication is unnecessary.

Significance. If the geometric conditions and equivalences hold in practice, the paper offers a precise theoretical account of observed attention patterns, clarifies the gap between oversmoothing-prevention requirements and what sinks deliver, and explains the empirical preference for sinks via representational cost. The introduction of diagonal patterns as an alternative and the relaxation to self-communication are useful conceptual advances; the empirical checks on smoothing conditions add practical grounding.

major comments (3)

[§3] §3 (geometric conditions for sinks): The necessary alignment between the sink token embedding and every other embedding is required for both the equivalence proof and the subsequent cost comparison; the manuscript does not report any empirical measurement of this alignment in pretrained models, leaving the applicability of the main claims conditional on an unverified assumption.
[§4] §4 (equivalence to hard attention switch): Theorem establishing that sink attention yields output identically zero holds only under the alignment condition of Assumption 2; without evidence that this geometry is realized in trained transformers, the claimed equivalence does not yet explain attention behavior in practice.
[§5] §5 (cost comparison after relaxation): The quantitative argument favoring sinks over diagonal patterns inherits the same alignment precondition; if the alignment is weak or absent, the cost analysis does not account for why pretrained models prefer sinks.

minor comments (2)

[§3.2] The definition of 'dense' versus 'sparse' attention in the oversmoothing comparison could be stated more explicitly with reference to the attention matrix sparsity pattern.
[Figures 2-4] Figure captions should indicate whether the plotted attention maps are from a specific layer or averaged across layers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight the need for stronger empirical grounding of our theoretical results on alignment conditions. We address each major comment below and will revise the manuscript accordingly to incorporate additional empirical analysis where feasible.

read point-by-point responses

Referee: [§3] §3 (geometric conditions for sinks): The necessary alignment between the sink token embedding and every other embedding is required for both the equivalence proof and the subsequent cost comparison; the manuscript does not report any empirical measurement of this alignment in pretrained models, leaving the applicability of the main claims conditional on an unverified assumption.

Authors: We agree that the alignment condition is foundational to the geometric analysis, equivalence proof, and cost comparison. The manuscript explicitly derives this as a necessary condition rather than assuming it holds universally. To address applicability, the revised manuscript will include new empirical measurements: we will compute average cosine similarities (and other alignment metrics) between sink token embeddings and other token embeddings across pretrained models (e.g., BERT, GPT-2) on standard benchmarks. This addition will quantify how often the condition is approximately satisfied in practice. revision: yes
Referee: [§4] §4 (equivalence to hard attention switch): Theorem establishing that sink attention yields output identically zero holds only under the alignment condition of Assumption 2; without evidence that this geometry is realized in trained transformers, the claimed equivalence does not yet explain attention behavior in practice.

Authors: The theorem is stated with the alignment assumption (Assumption 2), as required for the output to be identically zero. We will revise §4 to explicitly restate this precondition and add a discussion of its implications. Additionally, the new empirical alignment measurements (as noted in response to §3) will be cross-referenced here to assess the practical relevance of the equivalence. If alignment is only approximate, we will note that the hard-switch behavior may hold approximately, consistent with the paper's later relaxation to self-communication. revision: yes
Referee: [§5] §5 (cost comparison after relaxation): The quantitative argument favoring sinks over diagonal patterns inherits the same alignment precondition; if the alignment is weak or absent, the cost analysis does not account for why pretrained models prefer sinks.

Authors: The cost comparison in the relaxed setting (allowing self-communication) is designed to close the gap between oversmoothing prevention and what sinks provide, but we acknowledge it builds on the geometric setup. In revision, we will clarify the dependence on alignment and integrate the new empirical results to evaluate whether the cost advantage persists under observed alignments in pretrained models. If alignment is weak, we will discuss alternative explanations (e.g., optimization dynamics) while preserving the quantitative comparison as a theoretical contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivations are self-contained geometric proofs and empirical checks

full rationale

The paper's central results consist of explicit geometric conditions (necessary alignment of sink embeddings), formal equivalence proofs between sinks and hard attention switches (output identically zero), conditions for dense vs. sparse smoothing, and a quantitative cost comparison after relaxing to allow self-communication. These steps are presented as mathematical derivations and empirical verifications of stated assumptions rather than reductions to fitted parameters, self-definitional loops, or load-bearing self-citations. No quoted step renames a known result as a new prediction or imports uniqueness via prior author work as an unverified axiom. The alignment precondition is openly declared as necessary, so the equivalence and cost claims are conditional rather than smuggled. The derivation chain remains independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard attention definitions and geometric embedding assumptions; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Attention output can be made identically zero under certain embedding alignments
Invoked in the proof of equivalence between sinks and hard attention switch

pith-pipeline@v0.9.0 · 5493 in / 1163 out tokens · 47317 ms · 2026-05-12T02:10:16.801555+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 2 internal anchors

[1]

Bronstein

Federico Barbero, Alvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Petar Veliˇckovi´c, Razvan Pascanu, and Michael M. Bronstein. Why do LLMs attend to the first token? InSecond Conference on Language Modeling, 2025

work page 2025
[2]

Transform- ers need glasses! information over-squashing in language tasks

Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran, João Madeira Araújo, Oleksandr Vitvitskyi, Razvan Pascanu, and Petar Veli ˇckovi´c. Transform- ers need glasses! information over-squashing in language tasks. InConference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[3]

Recurrent neural networks for adaptive temporal processing

Yoshua Bengio, Paolo Frasconi, and Marco Gori. Recurrent neural networks for adaptive temporal processing. InProceedings of the 6th Italian Workshop on Parallel Architectures and Neural Networks WIRN93, pages 85–117, 1993

work page 1993
[4]

Quantizable transformers: Removing outliers by helping attention heads do nothing

Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Quantizable transformers: Removing outliers by helping attention heads do nothing. InConference on Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[5]

Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, and Wen Xiao. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. InFirst Conference on Language Modeling, 2025

work page 2025
[6]

Vision transformers need registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[7]

Attention is not all you need: Pure attention loses rank doubly exponentially with depth

Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. InInternational Conference on Machine Learning (ICML), 2021

work page 2021
[8]

Bronstein, and Matt Kusner

Gbetondji Jean-Sebastien Dovonon, Michael M. Bronstein, and Matt Kusner. Setting the record straight on transformer oversmoothing. InTransactions on Machine Learning Research (TMLR), 2025

work page 2025
[9]

Tinystories: How small can language models be and still speak coherent english?, 2023

Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english?, 2023

work page 2023
[10]

The emergence of clusters in self-attention dynamics

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics. InConference on Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[11]

Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation

Alessio Giorlandino and Sebastian Goldt. Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation.arXiv preprint arXiv:2505.24333, 2025

work page arXiv 2025
[12]

When attention sink emerges in language models: An empirical view

Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view. InInternational Conference on Learning Representations (ICLR), 2025. 10

work page 2025
[13]

Jordan, and Song Mei

Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I. Jordan, and Song Mei. Active-dormant attention heads: Mechanistically demystifying extreme-token phenomena in LLMs. InThe Second Conference on Parsimony and Learning (Recent Spotlight Track), 2025

work page 2025
[14]

Stochastic parroting in temporal attention–regulating the diagonal sink.arXiv preprint arXiv:2602.10956, 2026

Victoria Hankemeier and Malte Schilling. Stochastic parroting in temporal attention–regulating the diagonal sink.arXiv preprint arXiv:2602.10956, 2026

work page arXiv 2026
[15]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. CodeSearchNet challenge: Evaluating the state of semantic code search.arXiv preprint arXiv:1909.09436, 2019

work page internal anchor Pith review arXiv 1909
[16]

Motifs in attention patterns of large language models

Michael Ivanitskiy, Cecilia Diniz Behn, and Samy Wu Fung. Motifs in attention patterns of large language models. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025

work page 2025
[17]

See what you are told: Visual attention sink in large multimodal models

Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[18]

Clustering in causal attention masking

Nikita Karagodin, Yury Polyanskiy, and Philippe Rigollet. Clustering in causal attention masking. InConference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[19]

char-rnn.https://github.com/karpathy/char-rnn, 2015

Andrej Karpathy. char-rnn.https://github.com/karpathy/char-rnn, 2015

work page 2015
[20]

Weight decay induces low-rank attention layers

Seijin Kobayashi, Yassir Akram, and Johannes V on Oswald. Weight decay induces low-rank attention layers. InConference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[21]

Streamingdialogue: Prolonged dialogue learning via long context compression with minimal losses

Jia-Nan Li, Quan Tu, Cunli Mao, Zhengtao Yu, Ji-Rong Wen, and Rui Yan. Streamingdialogue: Prolonged dialogue learning via long context compression with minimal losses. InConference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[22]

To sink or not to sink: Visual information pathways in large vision-language models

Jiayun Luo, Wan-Cyuan Fan, Lyuyang Wang, Xiangteng He, Tanzila Rahman, Purang Abol- maesumi, and Leonid Sigal. To sink or not to sink: Visual information pathways in large vision-language models. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026
[23]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InInternational Conference on Learning Representations, 2017

work page 2017
[24]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InInternational Conference on Learning Representations (ICLR), 2017

work page 2017
[25]

Signal propagation in transformers: Theoretical perspectives and the role of rank collapse

Lorenzo Noci, Sotiris Anagnostidis, Luca Biggio, Antonio Orvieto, Sidak Pal Singh, and Aurelien Lucchi. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. InConference on Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[26]

A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training.arXiv preprint arXiv: 2601.22966,

Zihan Qiu, Zeyu Huang, Kaiyue Wen, Peng Jin, Bo Zheng, Yuxin Zhou, Haofeng Huang, Zekun Wang, Xiao Li, Huaqing Zhang, et al. A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training.arXiv preprint arXiv:2601.22966, 2026

work page arXiv 2026
[27]

Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. InConference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[28]

Attention sinks and compression valleys in llms are two sides of the same coin.arXiv preprint arXiv:2510.06477,

Enrique Queipo-de Llano, Álvaro Arroyo, Federico Barbero, Xiaowen Dong, Michael Bronstein, Yann LeCun, and Ravid Shwartz-Ziv. Attention sinks and compression valleys in llms are two sides of the same coin.arXiv preprint arXiv:2510.06477, 2025

work page arXiv 2025
[29]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. 11

work page 2020
[30]

Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks

Yuval Ran-Milo. Attention sinks are provably necessary in softmax transformers: Evidence from trigger-conditional tasks.arXiv preprint arXiv:2603.11487, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

E., Petruzzi, S., Michielon, E., Silvestri, F., Scarda- pane, S., and Devoto, A

Maximo Eduardo Rulli, Simone Petruzzi, Edoardo Michielon, Fabrizio Silvestri, Simone Scardapane, and Alessio Devoto. Attention sinks in diffusion language models.arXiv preprint arXiv:2510.15731, 2025

work page arXiv 2025
[32]

What are you sinking? a geometric approach on attention sink

Valeria Ruscio, Umberto Nanni, Fabrizio Silvestri, et al. What are you sinking? a geometric approach on attention sink. InConference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[33]

Mind the gap: a spectral analysis of rank collapse and signal propagation in attention layers.arXiv preprint arXiv:2410.07799, 2024

Thiziri Nait Saada, Alireza Naderi, and Jared Tanner. Mind the gap: a spectral analysis of rank collapse and signal propagation in attention layers.arXiv preprint arXiv:2410.07799, 2024

work page arXiv 2024
[34]

Transformers, parallel computation, and logarithmic depth

Clayton Sanford, Daniel Hsu, and Matus Telgarsky. Transformers, parallel computation, and logarithmic depth. InInternational Conference on Machine Learning (ICML), 2024

work page 2024
[35]

The graph neural network model.IEEE transactions on neural networks, 20(1):61–80, 2008

Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model.IEEE transactions on neural networks, 20(1):61–80, 2008

work page 2008
[36]

Han Shi, JIAHUI GAO, Hang Xu, Xiaodan Liang, Zhenguo Li, Lingpeng Kong, Stephen M. S. Lee, and James Kwok. Revisiting over-smoothing in BERT from the perspective of graph. In International Conference on Learning Representations (ICLR), 2022

work page 2022
[37]

Attention reallocation: Towards zero-cost and controllable hallucination mitigation of mllms

Chongjun Tu, Peng Ye, Dongzhan Zhou, Lei Bai, Gang Yu, Tao Chen, and Wanli Ouyang. Attention reallocation: Towards zero-cost and controllable hallucination mitigation of mllms. International Journal of Computer Vision, 134(1):22, 2026

work page 2026
[38]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. InConference on Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[39]

Anti-oversmoothing in deep vision transformers via the fourier domain analysis: From theory to practice

Peihao Wang, Wenqing Zheng, Tianlong Chen, and Zhangyang Wang. Anti-oversmoothing in deep vision transformers via the fourier domain analysis: From theory to practice. In International Conference on Learning Representations (ICLR), 2022

work page 2022
[40]

Mirage in the eyes: Hallucination attack on multi-modal large language models with only attention sink

Yining Wang, Mi Zhang, Junjie Sun, Chenyue Wang, Min Yang, Hui Xue, Jialing Tao, Ranjie Duan, and Jiexi Liu. Mirage in the eyes: Hallucination attack on multi-modal large language models with only attention sink. InUSENIX Security Symposium, 2025

work page 2025
[41]

Anal- ysis of attention in video diffusion transformers.arXiv preprint arXiv:2504.10317,

Yuxin Wen, Jim Wu, Ajay Jain, Tom Goldstein, and Ashwinee Panda. Analysis of attention in video diffusion transformers.arXiv preprint arXiv:2504.10317, 2025

work page arXiv 2025
[42]

On the role of attention masks and layernorm in transformers

Xinyi Wu, Amir Ajorlou, Yifei Wang, Stefanie Jegelka, and Ali Jadbabaie. On the role of attention masks and layernorm in transformers. InConference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[43]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[44]

Unveiling and harnessing hidden attention sinks: Enhancing large language models without training through attention calibration

Zhongzhi Yu, Zheng Wang, Yonggan Fu, Huihong Shi, Khalid Shaikh, and Yingyan Celine Lin. Unveiling and harnessing hidden attention sinks: Enhancing large language models without training through attention calibration. InInternational Conference on Machine Learning (ICML), 2024

work page 2024
[45]

Exclusive self attention.arXiv preprint arXiv:2603.09078, 2026

Shuangfei Zhai. Exclusive self attention.arXiv preprint arXiv:2603.09078, 2026

work page arXiv 2026
[46]

catch, tag, and release

Stephen Zhang, Mustafa Khan, and Vardan Papyan. Attention sinks and outlier features: A “catch, tag, and release” mechanism for embeddings. InConference on Neural Information Processing Systems (NeurIPS), 2025. 12 A Proofs for Section 4 Proposition 4.1.Let Z∈R d×(T+1) be a matrix of input embeddingsz0, z1, . . . , zT , and let J ⊂[T] be a set of indices. ...

work page 2025
[47]

We distinguish two cases: (i) βtr(BW)>0

Since D′ k(1) =−2β 1− 1 k tr(BW) and D′ k(0) =−2 1− 1 k (βtr(BW) + tr(BW ⊤W)) , Dk(λ) strictly decreases iff βtr(BW)>0 and strictly increases iff βtr(BW) + tr(BW ⊤W)<0 . We distinguish two cases: (i) βtr(BW)>0 . Then, E[Yi(λ)⊤Yj(λ)] strictly increases with λ and Dk(λ) strictly decreases. Since both numerator and denominator are assumed to be positive, we ...

work page
[48]

Sink upper bound.There exists a sink solution with total cost at most Usink ≤12 κC√ d + 5(rη +C)

work page
[49]

Diagonal lower bound.Every diagonal solution has total cost at least Ldiag ≥ κ δ2 √ d reff(ΣD) + κC 2 √ d . 22

work page
[50]

Diagonal upper bound.There exists a diagonal solution with total cost at most Udiag ≤13 κC√ d + 4κ √ d ∆diag rη + 3(rη +C)

work page
[51]

Proof.We prove the four bounds one by one

Sink lower bound.Every sink solution has total cost at least Lsink ≥ κC 2 √ d . Proof.We prove the four bounds one by one

work page
[52]

Step 1: query-key construction.Because every nuisance vector ηci is orthogonal to S, we have U ⊤ηci = 0and thereforeΠη ci = 0

Sink upper bound.We construct an explicit sink solution. Step 1: query-key construction.Because every nuisance vector ηci is orthogonal to S, we have U ⊤ηci = 0and thereforeΠη ci = 0. Hence Πdci = Π(λ ¯dc +η ci) =λ √ d ec. Similarly, ΠbBOS = √ d e0,Πc c = √ d eC+c. Letr:= (0,1, . . . ,1) ⊤ ∈R m. Define M:=a re ⊤ 0 +b CX c=1 eC+ce⊤ c , W QK := Π⊤MΠ. We cho...

work page
[53]

Step 1: dormant self-attention cost.Fix a group c and a pair {i, j} ∈ P c

Diagonal lower bound.This lower bound only uses attention-map constraints, so the relaxed value labels do not affect it. Step 1: dormant self-attention cost.Fix a group c and a pair {i, j} ∈ P c. Since the diagonal pattern must work for both relative orders, we have d⊤ ciWQK dci −d ⊤ ciWQK dcj√ d ≥κ and d⊤ cjWQK dcj −d ⊤ cjWQK dci √ d ≥κ. 25 Adding the tw...

work page
[54]

LetP η be the orthogonal projector onto the nuisance span

Diagonal upper bound.We construct an explicit diagonal solution. LetP η be the orthogonal projector onto the nuisance span. Define W diag QK = Π⊤MdiagΠ +α diagPη, where Mdiag =s CX c=1 ece⊤ c +a CX c=1 eC+ce⊤ C+c +b CX c=1 eC+ce⊤ c , with s:= κ λ2 √ d , a:= κ√ d , b:= 2κ λ √ d , α diag := 2κ √ d ∆diag . We now verify the attention scores induced by this c...

work page
[55]

Sink lower bound.This lower bound only uses copy-routing constraints in the sink pattern. The copy-paste queryc c must make the correct group have score larger than BOS byκ: c⊤ c WQK(λ ¯dc −b BOS)√ d ≥κ, and it must make the correct group have score larger than any wrong dormant group byκ: c⊤ c WQK(λ ¯dc −λ ¯dc′)√ d ≥κ(c ′ ̸=c). For a fixedc, summing the ...

work page