Recognition: no theorem link
Sink vs. diagonal patterns as mechanisms for attention switch and oversmoothing prevention
Pith reviewed 2026-05-12 02:10 UTC · model grok-4.3
The pith
Sinks in attention are equivalent to hard switches that zero the output and cost less to represent than diagonal patterns.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sinks can be represented under a necessary geometric alignment between the sink embedding and all other embeddings. Under this alignment sinks are equivalent to hard attention switches that set the attention output to identically zero. When self-communication is allowed, the cost of representing sinks is lower than the cost of representing diagonal patterns, which accounts for the prevalence of sinks in pretrained transformers and shows why attention layers sometimes act like MLPs.
What carries the argument
The equivalence between sinks and hard attention switch (output identically zero), together with the quantitative cost comparison to diagonal patterns.
Load-bearing premise
The sink embedding must be aligned with all other token embeddings for the geometric representation conditions and cost comparison to hold.
What would settle it
Finding a trained transformer in which a sink token is present but the attention output is not identically zero, or in which the representation cost of diagonal patterns is lower than that of sinks.
Figures
read the original abstract
This paper studies the role of sinks and diagonal patterns as attention switch and anti-oversmoothing mechanisms. We analyze geometric conditions under which sinks can be represented, showing a necessary alignment between the embedding of the sink and all other embeddings. Next, we refine the current understanding of the role of sinks in oversmoothing prevention: we specify the conditions under which dense attention provably smooths more than sparse attention, and empirically verify that such conditions are often satisfied in practice. We further prove an equivalence between sinks and hard attention switch, in which the output of the attention is identically 0. Finally, we relax the hard attention switch by allowing token self-communication: we provide a quantitative comparison of the costs of representing sinks vs.\ diagonal patterns, showing why sinks are favored in pretrained transformers. The introduction and analysis of diagonal patterns and the generalization of the attention switch close the gap between what oversmoothing prevention requires and what sinks provide, while also establishing when and why attention layers act like MLPs if token communication is not necessary.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes sinks and diagonal patterns in transformer attention as mechanisms for attention switching and oversmoothing prevention. It derives geometric conditions for sink representation (requiring alignment between the sink embedding and all other token embeddings), proves an equivalence between sinks and hard attention switches (where attention output is identically zero), specifies and empirically verifies conditions under which dense attention smooths more than sparse attention, and after relaxing the hard switch to allow self-communication, provides a quantitative cost comparison showing why sinks are favored over diagonal patterns in pretrained transformers. The work also discusses when attention layers behave like MLPs if inter-token communication is unnecessary.
Significance. If the geometric conditions and equivalences hold in practice, the paper offers a precise theoretical account of observed attention patterns, clarifies the gap between oversmoothing-prevention requirements and what sinks deliver, and explains the empirical preference for sinks via representational cost. The introduction of diagonal patterns as an alternative and the relaxation to self-communication are useful conceptual advances; the empirical checks on smoothing conditions add practical grounding.
major comments (3)
- [§3] §3 (geometric conditions for sinks): The necessary alignment between the sink token embedding and every other embedding is required for both the equivalence proof and the subsequent cost comparison; the manuscript does not report any empirical measurement of this alignment in pretrained models, leaving the applicability of the main claims conditional on an unverified assumption.
- [§4] §4 (equivalence to hard attention switch): Theorem establishing that sink attention yields output identically zero holds only under the alignment condition of Assumption 2; without evidence that this geometry is realized in trained transformers, the claimed equivalence does not yet explain attention behavior in practice.
- [§5] §5 (cost comparison after relaxation): The quantitative argument favoring sinks over diagonal patterns inherits the same alignment precondition; if the alignment is weak or absent, the cost analysis does not account for why pretrained models prefer sinks.
minor comments (2)
- [§3.2] The definition of 'dense' versus 'sparse' attention in the oversmoothing comparison could be stated more explicitly with reference to the attention matrix sparsity pattern.
- [Figures 2-4] Figure captions should indicate whether the plotted attention maps are from a specific layer or averaged across layers.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight the need for stronger empirical grounding of our theoretical results on alignment conditions. We address each major comment below and will revise the manuscript accordingly to incorporate additional empirical analysis where feasible.
read point-by-point responses
-
Referee: [§3] §3 (geometric conditions for sinks): The necessary alignment between the sink token embedding and every other embedding is required for both the equivalence proof and the subsequent cost comparison; the manuscript does not report any empirical measurement of this alignment in pretrained models, leaving the applicability of the main claims conditional on an unverified assumption.
Authors: We agree that the alignment condition is foundational to the geometric analysis, equivalence proof, and cost comparison. The manuscript explicitly derives this as a necessary condition rather than assuming it holds universally. To address applicability, the revised manuscript will include new empirical measurements: we will compute average cosine similarities (and other alignment metrics) between sink token embeddings and other token embeddings across pretrained models (e.g., BERT, GPT-2) on standard benchmarks. This addition will quantify how often the condition is approximately satisfied in practice. revision: yes
-
Referee: [§4] §4 (equivalence to hard attention switch): Theorem establishing that sink attention yields output identically zero holds only under the alignment condition of Assumption 2; without evidence that this geometry is realized in trained transformers, the claimed equivalence does not yet explain attention behavior in practice.
Authors: The theorem is stated with the alignment assumption (Assumption 2), as required for the output to be identically zero. We will revise §4 to explicitly restate this precondition and add a discussion of its implications. Additionally, the new empirical alignment measurements (as noted in response to §3) will be cross-referenced here to assess the practical relevance of the equivalence. If alignment is only approximate, we will note that the hard-switch behavior may hold approximately, consistent with the paper's later relaxation to self-communication. revision: yes
-
Referee: [§5] §5 (cost comparison after relaxation): The quantitative argument favoring sinks over diagonal patterns inherits the same alignment precondition; if the alignment is weak or absent, the cost analysis does not account for why pretrained models prefer sinks.
Authors: The cost comparison in the relaxed setting (allowing self-communication) is designed to close the gap between oversmoothing prevention and what sinks provide, but we acknowledge it builds on the geometric setup. In revision, we will clarify the dependence on alignment and integrate the new empirical results to evaluate whether the cost advantage persists under observed alignments in pretrained models. If alignment is weak, we will discuss alternative explanations (e.g., optimization dynamics) while preserving the quantitative comparison as a theoretical contribution. revision: yes
Circularity Check
No significant circularity; derivations are self-contained geometric proofs and empirical checks
full rationale
The paper's central results consist of explicit geometric conditions (necessary alignment of sink embeddings), formal equivalence proofs between sinks and hard attention switches (output identically zero), conditions for dense vs. sparse smoothing, and a quantitative cost comparison after relaxing to allow self-communication. These steps are presented as mathematical derivations and empirical verifications of stated assumptions rather than reductions to fitted parameters, self-definitional loops, or load-bearing self-citations. No quoted step renames a known result as a new prediction or imports uniqueness via prior author work as an unverified axiom. The alignment precondition is openly declared as necessary, so the equivalence and cost claims are conditional rather than smuggled. The derivation chain remains independent of its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Attention output can be made identically zero under certain embedding alignments
Reference graph
Works this paper leans on
- [1]
-
[2]
Transform- ers need glasses! information over-squashing in language tasks
Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran, João Madeira Araújo, Oleksandr Vitvitskyi, Razvan Pascanu, and Petar Veli ˇckovi´c. Transform- ers need glasses! information over-squashing in language tasks. InConference on Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[3]
Recurrent neural networks for adaptive temporal processing
Yoshua Bengio, Paolo Frasconi, and Marco Gori. Recurrent neural networks for adaptive temporal processing. InProceedings of the 6th Italian Workshop on Parallel Architectures and Neural Networks WIRN93, pages 85–117, 1993
work page 1993
-
[4]
Quantizable transformers: Removing outliers by helping attention heads do nothing
Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Quantizable transformers: Removing outliers by helping attention heads do nothing. InConference on Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[5]
Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling
Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, and Wen Xiao. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. InFirst Conference on Language Modeling, 2025
work page 2025
-
[6]
Vision transformers need registers
Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[7]
Attention is not all you need: Pure attention loses rank doubly exponentially with depth
Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. InInternational Conference on Machine Learning (ICML), 2021
work page 2021
-
[8]
Gbetondji Jean-Sebastien Dovonon, Michael M. Bronstein, and Matt Kusner. Setting the record straight on transformer oversmoothing. InTransactions on Machine Learning Research (TMLR), 2025
work page 2025
-
[9]
Tinystories: How small can language models be and still speak coherent english?, 2023
Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english?, 2023
work page 2023
-
[10]
The emergence of clusters in self-attention dynamics
Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics. InConference on Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[11]
Alessio Giorlandino and Sebastian Goldt. Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation.arXiv preprint arXiv:2505.24333, 2025
-
[12]
When attention sink emerges in language models: An empirical view
Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view. InInternational Conference on Learning Representations (ICLR), 2025. 10
work page 2025
-
[13]
Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I. Jordan, and Song Mei. Active-dormant attention heads: Mechanistically demystifying extreme-token phenomena in LLMs. InThe Second Conference on Parsimony and Learning (Recent Spotlight Track), 2025
work page 2025
-
[14]
Victoria Hankemeier and Malte Schilling. Stochastic parroting in temporal attention–regulating the diagonal sink.arXiv preprint arXiv:2602.10956, 2026
-
[15]
CodeSearchNet Challenge: Evaluating the State of Semantic Code Search
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. CodeSearchNet challenge: Evaluating the state of semantic code search.arXiv preprint arXiv:1909.09436, 2019
work page internal anchor Pith review arXiv 1909
-
[16]
Motifs in attention patterns of large language models
Michael Ivanitskiy, Cecilia Diniz Behn, and Samy Wu Fung. Motifs in attention patterns of large language models. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025
work page 2025
-
[17]
See what you are told: Visual attention sink in large multimodal models
Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models. InInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[18]
Clustering in causal attention masking
Nikita Karagodin, Yury Polyanskiy, and Philippe Rigollet. Clustering in causal attention masking. InConference on Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[19]
char-rnn.https://github.com/karpathy/char-rnn, 2015
Andrej Karpathy. char-rnn.https://github.com/karpathy/char-rnn, 2015
work page 2015
-
[20]
Weight decay induces low-rank attention layers
Seijin Kobayashi, Yassir Akram, and Johannes V on Oswald. Weight decay induces low-rank attention layers. InConference on Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[21]
Streamingdialogue: Prolonged dialogue learning via long context compression with minimal losses
Jia-Nan Li, Quan Tu, Cunli Mao, Zhengtao Yu, Ji-Rong Wen, and Rui Yan. Streamingdialogue: Prolonged dialogue learning via long context compression with minimal losses. InConference on Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[22]
To sink or not to sink: Visual information pathways in large vision-language models
Jiayun Luo, Wan-Cyuan Fan, Lyuyang Wang, Xiangteng He, Tanzila Rahman, Purang Abol- maesumi, and Leonid Sigal. To sink or not to sink: Visual information pathways in large vision-language models. InInternational Conference on Learning Representations (ICLR), 2026
work page 2026
-
[23]
Pointer sentinel mixture models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InInternational Conference on Learning Representations, 2017
work page 2017
-
[24]
Pointer sentinel mixture models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InInternational Conference on Learning Representations (ICLR), 2017
work page 2017
-
[25]
Signal propagation in transformers: Theoretical perspectives and the role of rank collapse
Lorenzo Noci, Sotiris Anagnostidis, Luca Biggio, Antonio Orvieto, Sidak Pal Singh, and Aurelien Lucchi. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. InConference on Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[26]
Zihan Qiu, Zeyu Huang, Kaiyue Wen, Peng Jin, Bo Zheng, Yuxin Zhou, Haofeng Huang, Zekun Wang, Xiao Li, Huaqing Zhang, et al. A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training.arXiv preprint arXiv:2601.22966, 2026
-
[27]
Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. InConference on Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[28]
Enrique Queipo-de Llano, Álvaro Arroyo, Federico Barbero, Xiaowen Dong, Michael Bronstein, Yann LeCun, and Ravid Shwartz-Ziv. Attention sinks and compression valleys in llms are two sides of the same coin.arXiv preprint arXiv:2510.06477, 2025
-
[29]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. 11
work page 2020
-
[30]
Yuval Ran-Milo. Attention sinks are provably necessary in softmax transformers: Evidence from trigger-conditional tasks.arXiv preprint arXiv:2603.11487, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
E., Petruzzi, S., Michielon, E., Silvestri, F., Scarda- pane, S., and Devoto, A
Maximo Eduardo Rulli, Simone Petruzzi, Edoardo Michielon, Fabrizio Silvestri, Simone Scardapane, and Alessio Devoto. Attention sinks in diffusion language models.arXiv preprint arXiv:2510.15731, 2025
-
[32]
What are you sinking? a geometric approach on attention sink
Valeria Ruscio, Umberto Nanni, Fabrizio Silvestri, et al. What are you sinking? a geometric approach on attention sink. InConference on Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[33]
Thiziri Nait Saada, Alireza Naderi, and Jared Tanner. Mind the gap: a spectral analysis of rank collapse and signal propagation in attention layers.arXiv preprint arXiv:2410.07799, 2024
-
[34]
Transformers, parallel computation, and logarithmic depth
Clayton Sanford, Daniel Hsu, and Matus Telgarsky. Transformers, parallel computation, and logarithmic depth. InInternational Conference on Machine Learning (ICML), 2024
work page 2024
-
[35]
The graph neural network model.IEEE transactions on neural networks, 20(1):61–80, 2008
Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model.IEEE transactions on neural networks, 20(1):61–80, 2008
work page 2008
-
[36]
Han Shi, JIAHUI GAO, Hang Xu, Xiaodan Liang, Zhenguo Li, Lingpeng Kong, Stephen M. S. Lee, and James Kwok. Revisiting over-smoothing in BERT from the perspective of graph. In International Conference on Learning Representations (ICLR), 2022
work page 2022
-
[37]
Attention reallocation: Towards zero-cost and controllable hallucination mitigation of mllms
Chongjun Tu, Peng Ye, Dongzhan Zhou, Lei Bai, Gang Yu, Tao Chen, and Wanli Ouyang. Attention reallocation: Towards zero-cost and controllable hallucination mitigation of mllms. International Journal of Computer Vision, 134(1):22, 2026
work page 2026
-
[38]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. InConference on Neural Information Processing Systems (NeurIPS), 2017
work page 2017
-
[39]
Peihao Wang, Wenqing Zheng, Tianlong Chen, and Zhangyang Wang. Anti-oversmoothing in deep vision transformers via the fourier domain analysis: From theory to practice. In International Conference on Learning Representations (ICLR), 2022
work page 2022
-
[40]
Yining Wang, Mi Zhang, Junjie Sun, Chenyue Wang, Min Yang, Hui Xue, Jialing Tao, Ranjie Duan, and Jiexi Liu. Mirage in the eyes: Hallucination attack on multi-modal large language models with only attention sink. InUSENIX Security Symposium, 2025
work page 2025
-
[41]
Anal- ysis of attention in video diffusion transformers.arXiv preprint arXiv:2504.10317,
Yuxin Wen, Jim Wu, Ajay Jain, Tom Goldstein, and Ashwinee Panda. Analysis of attention in video diffusion transformers.arXiv preprint arXiv:2504.10317, 2025
-
[42]
On the role of attention masks and layernorm in transformers
Xinyi Wu, Amir Ajorlou, Yifei Wang, Stefanie Jegelka, and Ali Jadbabaie. On the role of attention masks and layernorm in transformers. InConference on Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[43]
Efficient streaming language models with attention sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[44]
Zhongzhi Yu, Zheng Wang, Yonggan Fu, Huihong Shi, Khalid Shaikh, and Yingyan Celine Lin. Unveiling and harnessing hidden attention sinks: Enhancing large language models without training through attention calibration. InInternational Conference on Machine Learning (ICML), 2024
work page 2024
-
[45]
Exclusive self attention.arXiv preprint arXiv:2603.09078, 2026
Shuangfei Zhai. Exclusive self attention.arXiv preprint arXiv:2603.09078, 2026
-
[46]
Stephen Zhang, Mustafa Khan, and Vardan Papyan. Attention sinks and outlier features: A “catch, tag, and release” mechanism for embeddings. InConference on Neural Information Processing Systems (NeurIPS), 2025. 12 A Proofs for Section 4 Proposition 4.1.Let Z∈R d×(T+1) be a matrix of input embeddingsz0, z1, . . . , zT , and let J ⊂[T] be a set of indices. ...
work page 2025
-
[47]
We distinguish two cases: (i) βtr(BW)>0
Since D′ k(1) =−2β 1− 1 k tr(BW) and D′ k(0) =−2 1− 1 k (βtr(BW) + tr(BW ⊤W)) , Dk(λ) strictly decreases iff βtr(BW)>0 and strictly increases iff βtr(BW) + tr(BW ⊤W)<0 . We distinguish two cases: (i) βtr(BW)>0 . Then, E[Yi(λ)⊤Yj(λ)] strictly increases with λ and Dk(λ) strictly decreases. Since both numerator and denominator are assumed to be positive, we ...
-
[48]
Sink upper bound.There exists a sink solution with total cost at most Usink ≤12 κC√ d + 5(rη +C)
-
[49]
Diagonal lower bound.Every diagonal solution has total cost at least Ldiag ≥ κ δ2 √ d reff(ΣD) + κC 2 √ d . 22
-
[50]
Diagonal upper bound.There exists a diagonal solution with total cost at most Udiag ≤13 κC√ d + 4κ √ d ∆diag rη + 3(rη +C)
-
[51]
Proof.We prove the four bounds one by one
Sink lower bound.Every sink solution has total cost at least Lsink ≥ κC 2 √ d . Proof.We prove the four bounds one by one
-
[52]
Sink upper bound.We construct an explicit sink solution. Step 1: query-key construction.Because every nuisance vector ηci is orthogonal to S, we have U ⊤ηci = 0and thereforeΠη ci = 0. Hence Πdci = Π(λ ¯dc +η ci) =λ √ d ec. Similarly, ΠbBOS = √ d e0,Πc c = √ d eC+c. Letr:= (0,1, . . . ,1) ⊤ ∈R m. Define M:=a re ⊤ 0 +b CX c=1 eC+ce⊤ c , W QK := Π⊤MΠ. We cho...
-
[53]
Step 1: dormant self-attention cost.Fix a group c and a pair {i, j} ∈ P c
Diagonal lower bound.This lower bound only uses attention-map constraints, so the relaxed value labels do not affect it. Step 1: dormant self-attention cost.Fix a group c and a pair {i, j} ∈ P c. Since the diagonal pattern must work for both relative orders, we have d⊤ ciWQK dci −d ⊤ ciWQK dcj√ d ≥κ and d⊤ cjWQK dcj −d ⊤ cjWQK dci √ d ≥κ. 25 Adding the tw...
-
[54]
LetP η be the orthogonal projector onto the nuisance span
Diagonal upper bound.We construct an explicit diagonal solution. LetP η be the orthogonal projector onto the nuisance span. Define W diag QK = Π⊤MdiagΠ +α diagPη, where Mdiag =s CX c=1 ece⊤ c +a CX c=1 eC+ce⊤ C+c +b CX c=1 eC+ce⊤ c , with s:= κ λ2 √ d , a:= κ√ d , b:= 2κ λ √ d , α diag := 2κ √ d ∆diag . We now verify the attention scores induced by this c...
-
[55]
Sink lower bound.This lower bound only uses copy-routing constraints in the sink pattern. The copy-paste queryc c must make the correct group have score larger than BOS byκ: c⊤ c WQK(λ ¯dc −b BOS)√ d ≥κ, and it must make the correct group have score larger than any wrong dormant group byκ: c⊤ c WQK(λ ¯dc −λ ¯dc′)√ d ≥κ(c ′ ̸=c). For a fixedc, summing the ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.