pith. machine review for the scientific record. sign in

arxiv: 2605.02144 · v1 · submitted 2026-05-04 · 💻 cs.LG

Recognition: 3 theorem links

· Lean Theorem

Projection-Free Transformers via Gaussian Kernel Attention

Archisman Ghosh, Debarshi Kundu, Swaroop Ghosh, Vasant Honavar

Pith reviewed 2026-05-08 19:25 UTC · model grok-4.3

classification 💻 cs.LG
keywords Gaussian Kernel Attentionprojection-free attentionRBF kernelkernel regressionefficient Transformerslanguage modelingcausal masking
0
0 comments X

The pith

Gaussian Kernel Attention replaces learned query-key projections in Transformers with a direct RBF kernel on per-head features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Gaussian Kernel Attention as a simpler alternative to standard self-attention. Instead of computing Q and K via separate learned projections, GKA applies a Gaussian RBF kernel directly to the per-head token features and learns only one bandwidth parameter per head. This yields a model with 0.42 times the parameters and 0.49 times the FLOPs of a baseline that still trains stably with near-zero train-validation gap on language modeling tasks. The design is presented as normalized kernel regression, connecting Transformer attention to classical non-local filtering methods. Experiments confirm competitive benchmark behavior at reduced compute, though bits-per-byte remains higher at the tested scale.

Core claim

Gaussian Kernel Attention computes token affinities directly using a Gaussian radial basis function kernel applied to per-head token features, with each head learning only a bandwidth parameter σ_h and a single output projection W_O to maintain compatibility. This replaces softmax(QK^T/sqrt(d))V and can be viewed as normalized kernel regression over tokens. In autoregressive language modeling with causal masking implemented via kernel renormalization, a depth-20 GKA model with 0.42× parameters and 0.49× training FLOPs trains stably, shows near-zero train-validation gap, and reaches competitive results on standard benchmarks despite higher bits-per-byte.

What carries the argument

Gaussian RBF kernel applied to per-head token features to produce attention weights, with per-head bandwidth σ_h and one shared output projection W_O.

If this is right

  • GKA supplies an explicit locality scale through the per-head bandwidth parameter.
  • The reduced parameter count and FLOPs enable stable training with a near-zero train-validation gap.
  • Causal masking and sliding windows are realized simply by masking and renormalizing the kernel.
  • The mechanism links modern Transformers to classical kernel smoothing and non-local filtering.
  • It supplies one concrete dimension in the accuracy-efficiency trade-off space for attention design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The explicit kernel form may make attention patterns easier to analyze or regularize than learned projections.
  • At substantially larger scales the absence of Q/K projections could limit the model's ability to learn highly task-specific similarities.
  • Hybrid architectures could apply GKA in early layers and standard attention in later layers to balance efficiency and expressivity.

Load-bearing premise

Raw per-head token features already contain enough information for a Gaussian kernel to generate useful attention weights without separate learned Q and K projections.

What would settle it

Training the GKA model at greater depth or on larger language corpora and observing that validation loss or downstream accuracy falls substantially behind a matched standard-attention baseline.

Figures

Figures reproduced from arXiv: 2605.02144 by Archisman Ghosh, Debarshi Kundu, Swaroop Ghosh, Vasant Honavar.

Figure 1
Figure 1. Figure 1: GKA-Transformer. Projection-based self-attention is replaced with Gaussian Kernel Attention (see view at source ↗
Figure 2
Figure 2. Figure 2: Gaussian Kernel Attention (simplified illustrative example). view at source ↗
read the original abstract

Self-attention in Transformers is typically implemented as $\mathrm{softmax}(QK^\top/\sqrt{d})V$, where $Q=XW_Q$, $K=XW_K$, and $V=XW_V$ are learned linear projections of the input $X$. We ask whether these learned projections are necessary, or whether they can be replaced by a simpler similarity-based diffusion operator. We introduce \textbf{Gaussian Kernel Attention} (GKA), a drop-in replacement for dot-product attention that computes token affinities directly using a Gaussian radial basis function (RBF) kernel applied to per-head token features. Each head learns only a bandwidth parameter $\sigma_h$, while a single output projection $W_O$ preserves compatibility with the standard Transformer interface. GKA can be interpreted as normalized kernel regression over tokens, linking modern Transformer architectures to classical non-local filtering and kernel smoothing methods. We evaluate GKA in both vision and language modeling settings. For autoregressive language modeling within the \texttt{nanochat} framework, we implement causal masking and sliding-window constraints by masking and renormalizing the Gaussian kernel. At depth 20, a GKA model with $0.42\times$ the parameters and $0.49\times$ the total training FLOPs of a standard attention baseline trains stably, exhibits a near-zero train-validation gap, and demonstrates competitive behavior on standard benchmarks, albeit with higher bits-per-byte (BPB) at this compute scale. Overall, GKA provides a minimal, interpretable attention mechanism with an explicit locality scale, offering a dimension in the accuracy-efficiency trade-off for Transformer design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Gaussian Kernel Attention (GKA) as a drop-in replacement for standard self-attention in Transformers. Instead of learned Q and K projections, GKA computes affinities using a Gaussian RBF kernel on per-head token features, learning only a bandwidth σ_h per head and retaining the output projection W_O. It interprets this as normalized kernel regression. In experiments on autoregressive language modeling with the nanochat framework at depth 20, the GKA model uses 0.42× parameters and 0.49× FLOPs of a baseline, trains stably with near-zero train-validation gap, and shows competitive performance on benchmarks, though with higher bits-per-byte (BPB). Similar evaluations are mentioned for vision settings.

Significance. If the central empirical claims hold, the work provides a parameter-efficient and interpretable alternative to dot-product attention, reducing resources while achieving stable training. The explicit locality scale via σ_h and link to classical non-local filtering methods add conceptual value. The reported efficiency gains at depth 20 represent a concrete step toward more minimal Transformer architectures, though the higher BPB indicates a trade-off that merits further exploration. The demonstration of stable training with reduced FLOPs is a concrete strength.

major comments (3)
  1. [Abstract / experimental evaluation] Abstract and experimental results: the claims of 'competitive behavior' and stable training with 0.42× parameters / 0.49× FLOPs rest on outcomes after fitting σ_h, but no error bars, full benchmark tables, or ablations on bandwidth learning (learned vs. fixed σ_h) are provided. This makes it difficult to quantify the BPB gap or assess robustness of the efficiency claims.
  2. [Method (Gaussian Kernel Attention)] Method section (GKA definition): the load-bearing assumption that raw per-head token features (obtained by splitting the input X without learned Q/K projections) already align Euclidean distance with task-relevant similarities for the Gaussian kernel exp(-||x_i^h - x_j^h||^2 / (2σ_h^2)) is not directly tested. An ablation restoring projections while retaining the kernel, or measuring adaptation of learned σ_h, is required to substantiate that the raw features suffice.
  3. [Language modeling experiments] Language modeling experiments: the higher BPB is noted without quantification relative to the baseline at matched compute, and details on exact per-head feature extraction (splitting, any normalization) and causal masking/renormalization implementation are insufficient to evaluate whether the near-zero train-validation gap generalizes beyond the tested nanochat depth-20 scale.
minor comments (2)
  1. Clarify the precise equations for implementing causal masking and sliding-window constraints by masking and renormalizing the Gaussian kernel matrix.
  2. Expand the vision-setting results with comparable metrics and tables to the language modeling experiments for better cross-domain assessment.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential of Gaussian Kernel Attention as a parameter-efficient alternative. We address each major comment below with targeted revisions to improve empirical robustness, substantiate key assumptions, and clarify implementation details.

read point-by-point responses
  1. Referee: [Abstract / experimental evaluation] Abstract and experimental results: the claims of 'competitive behavior' and stable training with 0.42× parameters / 0.49× FLOPs rest on outcomes after fitting σ_h, but no error bars, full benchmark tables, or ablations on bandwidth learning (learned vs. fixed σ_h) are provided. This makes it difficult to quantify the BPB gap or assess robustness of the efficiency claims.

    Authors: We agree that error bars, complete tables, and bandwidth ablations would strengthen the claims. In the revision we will report standard deviations from three independent runs for all metrics, include full benchmark tables comparing GKA against the baseline, and add an ablation of learned σ_h versus fixed values (e.g., σ_h = 1.0 and σ_h set to median pairwise distance). These additions will allow direct quantification of the BPB gap and robustness of the efficiency gains. revision: yes

  2. Referee: [Method (Gaussian Kernel Attention)] Method section (GKA definition): the load-bearing assumption that raw per-head token features (obtained by splitting the input X without learned Q/K projections) already align Euclidean distance with task-relevant similarities for the Gaussian kernel exp(-||x_i^h - x_j^h||^2 / (2σ_h^2)) is not directly tested. An ablation restoring projections while retaining the kernel, or measuring adaptation of learned σ_h, is required to substantiate that the raw features suffice.

    Authors: The referee correctly notes that this assumption is central and untested. We will add an ablation that restores learned Q and K projections, applies the Gaussian kernel to the projected features, and compares performance to the projection-free GKA. We will also report the distribution of learned σ_h values across heads and layers to demonstrate adaptation. This directly tests whether raw features suffice or whether projections remain beneficial. revision: yes

  3. Referee: [Language modeling experiments] Language modeling experiments: the higher BPB is noted without quantification relative to the baseline at matched compute, and details on exact per-head feature extraction (splitting, any normalization) and causal masking/renormalization implementation are insufficient to evaluate whether the near-zero train-validation gap generalizes beyond the tested nanochat depth-20 scale.

    Authors: We will expand the method section with explicit pseudocode for per-head splitting (simple channel partitioning after the input projection, no extra normalization), causal masking (upper-triangular mask on the kernel matrix followed by row-wise renormalization), and sliding-window constraints. For the BPB gap we will add a direct numerical comparison at the reported compute scale and discuss why a strictly matched-FLOP baseline would require new training runs; the primary efficiency claim remains the 0.42× parameter and 0.49× FLOP reduction at depth 20. These changes address the request for implementation transparency while acknowledging the scope of additional matched-compute experiments. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical proposal with observed training outcomes

full rationale

The paper defines GKA via the explicit RBF kernel formula on raw per-head features, introduces learnable σ_h and W_O, implements causal masking by renormalization, and reports post-training metrics (parameter count, FLOPs, BPB, stability) on nanochat and vision benchmarks. No step claims a first-principles derivation of performance quantities; the kernel-regression interpretation is presented as an analogy, not a reduction that forces the measured BPB or gap. All load-bearing claims rest on external training runs rather than self-referential fitting or self-citation chains.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that RBF kernel affinities on raw per-head features suffice for attention, plus the empirical observation that the resulting model trains stably at reduced scale.

free parameters (2)
  • bandwidth σ_h per head
    Learned scalar per attention head that sets the width of the Gaussian kernel and controls locality.
  • output projection W_O
    Single learned matrix retained to ensure dimensional compatibility with downstream layers.
axioms (1)
  • domain assumption Gaussian RBF kernel applied to token features can substitute for learned dot-product attention without major capacity loss
    Invoked when presenting GKA as a drop-in replacement and when interpreting it as normalized kernel regression.

pith-pipeline@v0.9.0 · 5593 in / 1436 out tokens · 51055 ms · 2026-05-08T19:25:06.494336+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

41 extracted references · 15 canonical work pages · 4 internal anchors

  1. [1]

    In: Proceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL)

    Abnar, S., Zuidema, W.: Quantifying attention flow in transformers. In: Proceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). pp. 4190–4197 (2020)

  2. [2]

    Reid, and Silvio Savarese

    Buades, A., Coll, B., Morel, J.M.: A non-local algorithm for image denoising. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). pp. 60–65 (2005).https://doi.org/10.1109/CVPR. 2005.38

  3. [3]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9650–9660 (2021)

  4. [4]

    In: Advances in Neural Information Processing Systems (NeurIPS)

    Chen, Y., Zeng, Q., Ji, H., Yang, Y.: Skyformer: Remodel self-attention with gaus- sian kernel and nyström method. In: Advances in Neural Information Processing Systems (NeurIPS). pp. 2122–2135 (2021),https://arxiv.org/abs/2111.00035

  5. [5]

    In: International Conference on Learning Representations (ICLR) (2021),https://openreview.net/forum?id= Ua6zuk0WRH

    Choromanski, K.M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J.Q., Mohiuddin, A., Kaiser, L., Belanger, D., Colwell, L., Weller, A.: Rethinking attention with performers. In: International Conference on Learning Representations (ICLR) (2021),https://openreview.net/forum?id= Ua6zuk0WRH

  6. [6]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: Practical automated data augmentation with a reduced search space. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 702–703 (2020)

  7. [7]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Dao, T., Fu, D.Y., Ermon, S., Rudra, A., Ré, C.: Flashattention: Fast and memory- efficient exact attention with IO-awareness. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 35, pp. 16344–16359 (2022),https://arxiv. org/abs/2205.14135

  8. [8]

    In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large- scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 248–255 (2009)

  9. [9]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021),https: //openreview.net/forum?id=YicbFdNTT...

  10. [10]

    journal of machine learning research6(12) (2005)

    Drineas, P., Mahoney, M.W., Cristianini, N.: On the nyström method for approx- imating a gram matrix for improved kernel-based learning. journal of machine learning research6(12) (2005)

  11. [11]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Hoffer, E., Ben-Nun, T., Hubara, I., Giladi, N., Hoefler, T., Soudry, D.: Augment your batch: Improving generalization through instance repetition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8129–8138 (2020)

  12. [12]

    In: European Conference on Computer Vision (ECCV)

    Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: European Conference on Computer Vision (ECCV). pp. 646–

  13. [13]

    GitHub repository (2025),https : / / github

    Karpathy, A.: nanochat. GitHub repository (2025),https : / / github . com / karpathy/nanochat, accessed 2026-02-28

  14. [14]

    In: Proceedings of the 37th International Conference on Machine Learning (ICML)

    Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are RNNs: Fast autoregressive transformers with linear attention. In: Proceedings of the 37th International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research, vol. 119, pp. 5156–5165. PMLR (2020),https://proceedings. mlr.press/v119/katharopoulos20a.html

  15. [15]

    In: Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning (KR) (2012),https://cdn.aaai.org/ocs/4492/ 4492-21843-1-PB.pdf

    Levesque, H.J., Davis, E., Morgenstern, L.: The winograd schema challenge. In: Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning (KR) (2012),https://cdn.aaai.org/ocs/4492/ 4492-21843-1-PB.pdf

  16. [16]

    Datacomp- LM : In search of the next generation of training sets for language models

    Li, J., Fang, A., Smyrnis, G., Ivgi, M., Jordan, M., Gadre, S., Bansal, H., Guha, E., Keh, S., Arora, K., et al.: Datacomp-lm: In search of the next generation of training sets for language models. arXiv preprint arXiv:2406.11794 (2024),https: //arxiv.org/abs/2406.11794

  17. [17]

    IEEE transactions on neural networks and learning systems26(1), 152–164 (2014)

    Li, M., Bi, W., Kwok, J.T., Lu, B.L.: Large-scale nyström kernel matrix approxi- mation using randomized svd. IEEE transactions on neural networks and learning systems26(1), 152–164 (2014)

  18. [18]

    arXiv preprint arXiv:2602.11534 (2026),https://arxiv.org/abs/2602.11534

    Liu, J., Yue, Y., Welling, M., Song, Y.: Krause synchronization transformers. arXiv preprint arXiv:2602.11534 (2026),https://arxiv.org/abs/2602.11534

  19. [19]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin trans- former: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 10012– 10022 (2021),https://openaccess.thecvf.com/content/ICCV2021/papers/ Liu _ Swin _ Transformer _ Hierarchical _ Visi...

  20. [20]

    In: International Conference on Learning Representations (ICLR) (2019)

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (ICLR) (2019)

  21. [21]

    In: Advances in Neural Information Processing Systems (NeurIPS)

    Luo, S., Li, S., Cai, T., He, D., Peng, D., Zheng, S., Ke, G., Wang, L., Liu, T.Y.: Stable, fast and accurate: Kernelized attention with relative positional encoding. In: Advances in Neural Information Processing Systems (NeurIPS). pp. 22795– 22807 (2021),https://arxiv.org/abs/2106.12566

  22. [22]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Mihaylov, T., Clark, P., Khot, T., Sabharwal, A.: Can a suit of armor conduct electricity? a new dataset for open book question answering. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2018).https://doi.org/10.18653/v1/D18-1260,https://aclanthology.org/ D18-1260/

  23. [23]

    GitHub repository (2024), https://github.com/mlfoundations/dclm 16 D.Kundu et al

    ML Foundations: dclm: Datacomp for language models. GitHub repository (2024), https://github.com/mlfoundations/dclm 16 D.Kundu et al

  24. [24]

    In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL)

    Paperno,D.,Kruszewski,G.,Lazaridou,A.,Pham,N.Q.,Bernardi,R.,Pezzelle,S., Baroni, M., Boleda, G., Fernández, R.: The LAMBADA dataset: Word prediction requiring a broad discourse context. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL). pp. 1525–1534 (2016). https://doi.org/10.18653/v1/P16- 1144,https://acla...

  25. [25]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    Penedo, G., Kydlíček, H., Ben Allal, L., Lozhkov, A., Mitchell, M., Raffel, C., von Werra, L., Wolf, T.: The fineweb datasets: Decanting the web for the finest text data at scale. arXiv preprint arXiv:2406.17557 (2024),https://arxiv.org/abs/ 2406.17557

  26. [26]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2007),https: //papers.nips.cc/paper/3182- random- features- for- large- scale- kernel- machines

    Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems (NeurIPS) (2007),https: //papers.nips.cc/paper/3182- random- features- for- large- scale- kernel- machines

  27. [27]

    In: AAAI Spring Symposium Series: Logical Formalizations of Commonsense Reasoning (2011),https://cdn.aaai

    Roemmele, M., Bejan, C.A., Gordon, A.S.: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In: AAAI Spring Symposium Series: Logical Formalizations of Commonsense Reasoning (2011),https://cdn.aaai. org/ocs/2418/2418-10878-1-PB.pdf

  28. [28]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    Sakaguchi,K.,LeBras,R.,Bhagavatula,C.,Choi,Y.:WinoGrande:Anadversarial winograd schema challenge at scale. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) (2020),https://arxiv.org/abs/1907.10641

  29. [29]

    In: Workshop on High Performance Analytics- Algorithms, Implementations, and Applications, Siam Conference on Data Mining

    Srinivasan, B.V., Hu, Q., Duraiswami, R., et al.: Gpuml: Graphical processors for speeding up kernel machines. In: Workshop on High Performance Analytics- Algorithms, Implementations, and Applications, Siam Conference on Data Mining. vol. 31 (2010)

  30. [30]

    In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the incep- tion architecture for computer vision. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2818–2826 (2016)

  31. [31]

    In: Proceedings of the 38th International Conference on Machine Learning (ICML)

    Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: Proceedings of the 38th International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research, vol. 139, pp. 10347–10357. PMLR (2021),https: //proceedings.mlr.press/v139/touvron21a.html

  32. [32]

    In: International Conference on Machine Learning (ICML)

    Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning (ICML). pp. 10347–10357. PMLR (2021)

  33. [33]

    In: Advances in Neural Information Processing Systems (NeurIPS)

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 30 (2017),https://arxiv.org/abs/1706. 03762

  34. [34]

    Linformer: Self-Attention with Linear Complexity

    Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H.: Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020),https://arxiv.org/ abs/2006.04768

  35. [35]

    doi:10.1109/CVPR.2018

    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR). pp. 7794–7803 (2018).https://doi.org/10.1109/CVPR.2018. 00813,https://openaccess.thecvf.com/content_cvpr_2018/papers/Wang_ Non-Local_Neural_Networks_CVPR_2018_paper.pdf

  36. [36]

    Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han

    Xiong, Y., Zeng, Z., Chakraborty, R., Tan, M., Fung, G., Li, Y., Singh, V.: Nys- trömformer: A nyström-based algorithm for approximating self-attention. In: Pro- Gaussian Kernel Attention 17 ceedings of the AAAI Conference on Artificial Intelligence (AAAI). pp. 14138– 14148 (2021).https://doi.org/10.1609/aaai.v35i16.17664,https://arxiv. org/abs/2102.03902

  37. [37]

    In: Saul, L., Weiss, Y., Bottou, L

    Yang, C., Duraiswami, R., Davis, L.S.: Efficient kernel machines using the improved fast gauss transform. In: Saul, L., Weiss, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems. vol. 17. MIT Press (2004),https : / / proceedings . neurips . cc / paper _ files / paper / 2004 / file / 85353d3b2f39b9c9b5ee3576578c04b7-Paper.pdf

  38. [38]

    In: IEEE/CVF Inter- national Conference on Computer Vision (ICCV)

    Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: Regularization strategy to train strong classifiers with localizable features. In: IEEE/CVF Inter- national Conference on Computer Vision (ICCV). pp. 6023–6032 (2019)

  39. [39]

    Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., Choi, Y.: HellaSwag: Can a ma- chine really finish your sentence? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) (2019).https://doi.org/ 10.18653/v1/P19-1472,https://aclanthology.org/P19-1472/

  40. [40]

    In: International Conference on Learning Representations (ICLR) (2018)

    Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations (ICLR) (2018)

  41. [41]

    Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmenta- tion. Proceedings of the AAAI Conference on Artificial Intelligence34(07), 13001– 13008 (2020) 7 Attention Visualization Methodology To provide interpretable insight into the behavior of Gaussian Kernel Attention (GKA), we construct an exhaustive visualization pipeline that pr...