arxiv: 2605.02144 · v1 · submitted 2026-05-04 · 💻 cs.LG

Recognition: 3 theorem links

· Lean Theorem

Projection-Free Transformers via Gaussian Kernel Attention

Archisman Ghosh, Debarshi Kundu, Swaroop Ghosh, Vasant Honavar

Pith reviewed 2026-05-08 19:25 UTC · model grok-4.3

classification 💻 cs.LG

keywords Gaussian Kernel Attentionprojection-free attentionRBF kernelkernel regressionefficient Transformerslanguage modelingcausal masking

0 comments

The pith

Gaussian Kernel Attention replaces learned query-key projections in Transformers with a direct RBF kernel on per-head features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Gaussian Kernel Attention as a simpler alternative to standard self-attention. Instead of computing Q and K via separate learned projections, GKA applies a Gaussian RBF kernel directly to the per-head token features and learns only one bandwidth parameter per head. This yields a model with 0.42 times the parameters and 0.49 times the FLOPs of a baseline that still trains stably with near-zero train-validation gap on language modeling tasks. The design is presented as normalized kernel regression, connecting Transformer attention to classical non-local filtering methods. Experiments confirm competitive benchmark behavior at reduced compute, though bits-per-byte remains higher at the tested scale.

Core claim

Gaussian Kernel Attention computes token affinities directly using a Gaussian radial basis function kernel applied to per-head token features, with each head learning only a bandwidth parameter σ_h and a single output projection W_O to maintain compatibility. This replaces softmax(QK^T/sqrt(d))V and can be viewed as normalized kernel regression over tokens. In autoregressive language modeling with causal masking implemented via kernel renormalization, a depth-20 GKA model with 0.42× parameters and 0.49× training FLOPs trains stably, shows near-zero train-validation gap, and reaches competitive results on standard benchmarks despite higher bits-per-byte.

What carries the argument

Gaussian RBF kernel applied to per-head token features to produce attention weights, with per-head bandwidth σ_h and one shared output projection W_O.

If this is right

GKA supplies an explicit locality scale through the per-head bandwidth parameter.
The reduced parameter count and FLOPs enable stable training with a near-zero train-validation gap.
Causal masking and sliding windows are realized simply by masking and renormalizing the kernel.
The mechanism links modern Transformers to classical kernel smoothing and non-local filtering.
It supplies one concrete dimension in the accuracy-efficiency trade-off space for attention design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The explicit kernel form may make attention patterns easier to analyze or regularize than learned projections.
At substantially larger scales the absence of Q/K projections could limit the model's ability to learn highly task-specific similarities.
Hybrid architectures could apply GKA in early layers and standard attention in later layers to balance efficiency and expressivity.

Load-bearing premise

Raw per-head token features already contain enough information for a Gaussian kernel to generate useful attention weights without separate learned Q and K projections.

What would settle it

Training the GKA model at greater depth or on larger language corpora and observing that validation loss or downstream accuracy falls substantially behind a matched standard-attention baseline.

Figures

Figures reproduced from arXiv: 2605.02144 by Archisman Ghosh, Debarshi Kundu, Swaroop Ghosh, Vasant Honavar.

**Figure 1.** Figure 1: GKA-Transformer. Projection-based self-attention is replaced with Gaussian Kernel Attention (see view at source ↗

**Figure 2.** Figure 2: Gaussian Kernel Attention (simplified illustrative example). view at source ↗

read the original abstract

Self-attention in Transformers is typically implemented as $\mathrm{softmax}(QK^\top/\sqrt{d})V$, where $Q=XW_Q$, $K=XW_K$, and $V=XW_V$ are learned linear projections of the input $X$. We ask whether these learned projections are necessary, or whether they can be replaced by a simpler similarity-based diffusion operator. We introduce \textbf{Gaussian Kernel Attention} (GKA), a drop-in replacement for dot-product attention that computes token affinities directly using a Gaussian radial basis function (RBF) kernel applied to per-head token features. Each head learns only a bandwidth parameter $\sigma_h$, while a single output projection $W_O$ preserves compatibility with the standard Transformer interface. GKA can be interpreted as normalized kernel regression over tokens, linking modern Transformer architectures to classical non-local filtering and kernel smoothing methods. We evaluate GKA in both vision and language modeling settings. For autoregressive language modeling within the \texttt{nanochat} framework, we implement causal masking and sliding-window constraints by masking and renormalizing the Gaussian kernel. At depth 20, a GKA model with $0.42\times$ the parameters and $0.49\times$ the total training FLOPs of a standard attention baseline trains stably, exhibits a near-zero train-validation gap, and demonstrates competitive behavior on standard benchmarks, albeit with higher bits-per-byte (BPB) at this compute scale. Overall, GKA provides a minimal, interpretable attention mechanism with an explicit locality scale, offering a dimension in the accuracy-efficiency trade-off for Transformer design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Gaussian Kernel Attention (GKA) as a drop-in replacement for standard self-attention in Transformers. Instead of learned Q and K projections, GKA computes affinities using a Gaussian RBF kernel on per-head token features, learning only a bandwidth σ_h per head and retaining the output projection W_O. It interprets this as normalized kernel regression. In experiments on autoregressive language modeling with the nanochat framework at depth 20, the GKA model uses 0.42× parameters and 0.49× FLOPs of a baseline, trains stably with near-zero train-validation gap, and shows competitive performance on benchmarks, though with higher bits-per-byte (BPB). Similar evaluations are mentioned for vision settings.

Significance. If the central empirical claims hold, the work provides a parameter-efficient and interpretable alternative to dot-product attention, reducing resources while achieving stable training. The explicit locality scale via σ_h and link to classical non-local filtering methods add conceptual value. The reported efficiency gains at depth 20 represent a concrete step toward more minimal Transformer architectures, though the higher BPB indicates a trade-off that merits further exploration. The demonstration of stable training with reduced FLOPs is a concrete strength.

major comments (3)

[Abstract / experimental evaluation] Abstract and experimental results: the claims of 'competitive behavior' and stable training with 0.42× parameters / 0.49× FLOPs rest on outcomes after fitting σ_h, but no error bars, full benchmark tables, or ablations on bandwidth learning (learned vs. fixed σ_h) are provided. This makes it difficult to quantify the BPB gap or assess robustness of the efficiency claims.
[Method (Gaussian Kernel Attention)] Method section (GKA definition): the load-bearing assumption that raw per-head token features (obtained by splitting the input X without learned Q/K projections) already align Euclidean distance with task-relevant similarities for the Gaussian kernel exp(-||x_i^h - x_j^h||^2 / (2σ_h^2)) is not directly tested. An ablation restoring projections while retaining the kernel, or measuring adaptation of learned σ_h, is required to substantiate that the raw features suffice.
[Language modeling experiments] Language modeling experiments: the higher BPB is noted without quantification relative to the baseline at matched compute, and details on exact per-head feature extraction (splitting, any normalization) and causal masking/renormalization implementation are insufficient to evaluate whether the near-zero train-validation gap generalizes beyond the tested nanochat depth-20 scale.

minor comments (2)

Clarify the precise equations for implementing causal masking and sliding-window constraints by masking and renormalizing the Gaussian kernel matrix.
Expand the vision-setting results with comparable metrics and tables to the language modeling experiments for better cross-domain assessment.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential of Gaussian Kernel Attention as a parameter-efficient alternative. We address each major comment below with targeted revisions to improve empirical robustness, substantiate key assumptions, and clarify implementation details.

read point-by-point responses

Referee: [Abstract / experimental evaluation] Abstract and experimental results: the claims of 'competitive behavior' and stable training with 0.42× parameters / 0.49× FLOPs rest on outcomes after fitting σ_h, but no error bars, full benchmark tables, or ablations on bandwidth learning (learned vs. fixed σ_h) are provided. This makes it difficult to quantify the BPB gap or assess robustness of the efficiency claims.

Authors: We agree that error bars, complete tables, and bandwidth ablations would strengthen the claims. In the revision we will report standard deviations from three independent runs for all metrics, include full benchmark tables comparing GKA against the baseline, and add an ablation of learned σ_h versus fixed values (e.g., σ_h = 1.0 and σ_h set to median pairwise distance). These additions will allow direct quantification of the BPB gap and robustness of the efficiency gains. revision: yes
Referee: [Method (Gaussian Kernel Attention)] Method section (GKA definition): the load-bearing assumption that raw per-head token features (obtained by splitting the input X without learned Q/K projections) already align Euclidean distance with task-relevant similarities for the Gaussian kernel exp(-||x_i^h - x_j^h||^2 / (2σ_h^2)) is not directly tested. An ablation restoring projections while retaining the kernel, or measuring adaptation of learned σ_h, is required to substantiate that the raw features suffice.

Authors: The referee correctly notes that this assumption is central and untested. We will add an ablation that restores learned Q and K projections, applies the Gaussian kernel to the projected features, and compares performance to the projection-free GKA. We will also report the distribution of learned σ_h values across heads and layers to demonstrate adaptation. This directly tests whether raw features suffice or whether projections remain beneficial. revision: yes
Referee: [Language modeling experiments] Language modeling experiments: the higher BPB is noted without quantification relative to the baseline at matched compute, and details on exact per-head feature extraction (splitting, any normalization) and causal masking/renormalization implementation are insufficient to evaluate whether the near-zero train-validation gap generalizes beyond the tested nanochat depth-20 scale.

Authors: We will expand the method section with explicit pseudocode for per-head splitting (simple channel partitioning after the input projection, no extra normalization), causal masking (upper-triangular mask on the kernel matrix followed by row-wise renormalization), and sliding-window constraints. For the BPB gap we will add a direct numerical comparison at the reported compute scale and discuss why a strictly matched-FLOP baseline would require new training runs; the primary efficiency claim remains the 0.42× parameter and 0.49× FLOP reduction at depth 20. These changes address the request for implementation transparency while acknowledging the scope of additional matched-compute experiments. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical proposal with observed training outcomes

full rationale

The paper defines GKA via the explicit RBF kernel formula on raw per-head features, introduces learnable σ_h and W_O, implements causal masking by renormalization, and reports post-training metrics (parameter count, FLOPs, BPB, stability) on nanochat and vision benchmarks. No step claims a first-principles derivation of performance quantities; the kernel-regression interpretation is presented as an analogy, not a reduction that forces the measured BPB or gap. All load-bearing claims rest on external training runs rather than self-referential fitting or self-citation chains.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that RBF kernel affinities on raw per-head features suffice for attention, plus the empirical observation that the resulting model trains stably at reduced scale.

free parameters (2)

bandwidth σ_h per head
Learned scalar per attention head that sets the width of the Gaussian kernel and controls locality.
output projection W_O
Single learned matrix retained to ensure dimensional compatibility with downstream layers.

axioms (1)

domain assumption Gaussian RBF kernel applied to token features can substitute for learned dot-product attention without major capacity loss
Invoked when presenting GKA as a drop-in replacement and when interpreting it as normalized kernel regression.

pith-pipeline@v0.9.0 · 5593 in / 1436 out tokens · 51055 ms · 2026-05-08T19:25:06.494336+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost (Jcost = ½(x+x⁻¹)−1) washburn_uniqueness_aczel unclear
K_ij^(h) = exp(−‖x_i^(h) − x_j^(h)‖² / (2σ_h²)), W^(h) = row_norm(K^(h)) ... Each head learns only a single scalar controlling locality.
IndisputableMonolith.Foundation.AlphaCoordinateFixation (parameter-free α-pin) alpha_pin_under_high_calibration unclear
Each head learns only a bandwidth parameter σ_h, while a single output projection W_O preserves compatibility with the standard Transformer interface.

Reference graph

Works this paper leans on

41 extracted references · 15 canonical work pages · 4 internal anchors

[1]

In: Proceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL)

Abnar, S., Zuidema, W.: Quantifying attention flow in transformers. In: Proceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). pp. 4190–4197 (2020)

2020
[2]

Reid, and Silvio Savarese

Buades, A., Coll, B., Morel, J.M.: A non-local algorithm for image denoising. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). pp. 60–65 (2005).https://doi.org/10.1109/CVPR. 2005.38

work page doi:10.1109/cvpr 2005
[3]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9650–9660 (2021)

2021
[4]

In: Advances in Neural Information Processing Systems (NeurIPS)

Chen, Y., Zeng, Q., Ji, H., Yang, Y.: Skyformer: Remodel self-attention with gaus- sian kernel and nyström method. In: Advances in Neural Information Processing Systems (NeurIPS). pp. 2122–2135 (2021),https://arxiv.org/abs/2111.00035

work page arXiv 2021
[5]

In: International Conference on Learning Representations (ICLR) (2021),https://openreview.net/forum?id= Ua6zuk0WRH

Choromanski, K.M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J.Q., Mohiuddin, A., Kaiser, L., Belanger, D., Colwell, L., Weller, A.: Rethinking attention with performers. In: International Conference on Learning Representations (ICLR) (2021),https://openreview.net/forum?id= Ua6zuk0WRH

2021
[6]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: Practical automated data augmentation with a reduced search space. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 702–703 (2020)

2020
[7]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Dao, T., Fu, D.Y., Ermon, S., Rudra, A., Ré, C.: Flashattention: Fast and memory- efficient exact attention with IO-awareness. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 35, pp. 16344–16359 (2022),https://arxiv. org/abs/2205.14135

work page internal anchor Pith review arXiv 2022
[8]

In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large- scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 248–255 (2009)

2009
[9]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021),https: //openreview.net/forum?id=YicbFdNTT...

work page Pith review arXiv 2021
[10]

journal of machine learning research6(12) (2005)

Drineas, P., Mahoney, M.W., Cristianini, N.: On the nyström method for approx- imating a gram matrix for improved kernel-based learning. journal of machine learning research6(12) (2005)

2005
[11]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Hoffer, E., Ben-Nun, T., Hubara, I., Giladi, N., Hoefler, T., Soudry, D.: Augment your batch: Improving generalization through instance repetition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8129–8138 (2020)

2020
[12]

In: European Conference on Computer Vision (ECCV)

Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: European Conference on Computer Vision (ECCV). pp. 646–
[13]

GitHub repository (2025),https : / / github

Karpathy, A.: nanochat. GitHub repository (2025),https : / / github . com / karpathy/nanochat, accessed 2026-02-28

2025
[14]

In: Proceedings of the 37th International Conference on Machine Learning (ICML)

Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are RNNs: Fast autoregressive transformers with linear attention. In: Proceedings of the 37th International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research, vol. 119, pp. 5156–5165. PMLR (2020),https://proceedings. mlr.press/v119/katharopoulos20a.html

2020
[15]

In: Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning (KR) (2012),https://cdn.aaai.org/ocs/4492/ 4492-21843-1-PB.pdf

Levesque, H.J., Davis, E., Morgenstern, L.: The winograd schema challenge. In: Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning (KR) (2012),https://cdn.aaai.org/ocs/4492/ 4492-21843-1-PB.pdf

2012
[16]

Datacomp- LM : In search of the next generation of training sets for language models

Li, J., Fang, A., Smyrnis, G., Ivgi, M., Jordan, M., Gadre, S., Bansal, H., Guha, E., Keh, S., Arora, K., et al.: Datacomp-lm: In search of the next generation of training sets for language models. arXiv preprint arXiv:2406.11794 (2024),https: //arxiv.org/abs/2406.11794

work page arXiv 2024
[17]

IEEE transactions on neural networks and learning systems26(1), 152–164 (2014)

Li, M., Bi, W., Kwok, J.T., Lu, B.L.: Large-scale nyström kernel matrix approxi- mation using randomized svd. IEEE transactions on neural networks and learning systems26(1), 152–164 (2014)

2014
[18]

arXiv preprint arXiv:2602.11534 (2026),https://arxiv.org/abs/2602.11534

Liu, J., Yue, Y., Welling, M., Song, Y.: Krause synchronization transformers. arXiv preprint arXiv:2602.11534 (2026),https://arxiv.org/abs/2602.11534

work page arXiv 2026
[19]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin trans- former: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 10012– 10022 (2021),https://openaccess.thecvf.com/content/ICCV2021/papers/ Liu _ Swin _ Transformer _ Hierarchical _ Visi...

2021
[20]

In: International Conference on Learning Representations (ICLR) (2019)

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (ICLR) (2019)

2019
[21]

In: Advances in Neural Information Processing Systems (NeurIPS)

Luo, S., Li, S., Cai, T., He, D., Peng, D., Zheng, S., Ke, G., Wang, L., Liu, T.Y.: Stable, fast and accurate: Kernelized attention with relative positional encoding. In: Advances in Neural Information Processing Systems (NeurIPS). pp. 22795– 22807 (2021),https://arxiv.org/abs/2106.12566

work page arXiv 2021
[22]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Mihaylov, T., Clark, P., Khot, T., Sabharwal, A.: Can a suit of armor conduct electricity? a new dataset for open book question answering. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2018).https://doi.org/10.18653/v1/D18-1260,https://aclanthology.org/ D18-1260/

work page doi:10.18653/v1/d18-1260 2018
[23]

GitHub repository (2024), https://github.com/mlfoundations/dclm 16 D.Kundu et al

ML Foundations: dclm: Datacomp for language models. GitHub repository (2024), https://github.com/mlfoundations/dclm 16 D.Kundu et al

2024
[24]

In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL)

Paperno,D.,Kruszewski,G.,Lazaridou,A.,Pham,N.Q.,Bernardi,R.,Pezzelle,S., Baroni, M., Boleda, G., Fernández, R.: The LAMBADA dataset: Word prediction requiring a broad discourse context. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL). pp. 1525–1534 (2016). https://doi.org/10.18653/v1/P16- 1144,https://acla...

work page doi:10.18653/v1/p16- 2016
[25]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Penedo, G., Kydlíček, H., Ben Allal, L., Lozhkov, A., Mitchell, M., Raffel, C., von Werra, L., Wolf, T.: The fineweb datasets: Decanting the web for the finest text data at scale. arXiv preprint arXiv:2406.17557 (2024),https://arxiv.org/abs/ 2406.17557

work page internal anchor Pith review arXiv 2024
[26]

In: Advances in Neural Information Processing Systems (NeurIPS) (2007),https: //papers.nips.cc/paper/3182- random- features- for- large- scale- kernel- machines

Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems (NeurIPS) (2007),https: //papers.nips.cc/paper/3182- random- features- for- large- scale- kernel- machines

2007
[27]

In: AAAI Spring Symposium Series: Logical Formalizations of Commonsense Reasoning (2011),https://cdn.aaai

Roemmele, M., Bejan, C.A., Gordon, A.S.: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In: AAAI Spring Symposium Series: Logical Formalizations of Commonsense Reasoning (2011),https://cdn.aaai. org/ocs/2418/2418-10878-1-PB.pdf

2011
[28]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Sakaguchi,K.,LeBras,R.,Bhagavatula,C.,Choi,Y.:WinoGrande:Anadversarial winograd schema challenge at scale. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) (2020),https://arxiv.org/abs/1907.10641

work page internal anchor Pith review arXiv 2020
[29]

In: Workshop on High Performance Analytics- Algorithms, Implementations, and Applications, Siam Conference on Data Mining

Srinivasan, B.V., Hu, Q., Duraiswami, R., et al.: Gpuml: Graphical processors for speeding up kernel machines. In: Workshop on High Performance Analytics- Algorithms, Implementations, and Applications, Siam Conference on Data Mining. vol. 31 (2010)

2010
[30]

In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the incep- tion architecture for computer vision. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2818–2826 (2016)

2016
[31]

In: Proceedings of the 38th International Conference on Machine Learning (ICML)

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: Proceedings of the 38th International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research, vol. 139, pp. 10347–10357. PMLR (2021),https: //proceedings.mlr.press/v139/touvron21a.html

2021
[32]

In: International Conference on Machine Learning (ICML)

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning (ICML). pp. 10347–10357. PMLR (2021)

2021
[33]

In: Advances in Neural Information Processing Systems (NeurIPS)

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 30 (2017),https://arxiv.org/abs/1706. 03762

2017
[34]

Linformer: Self-Attention with Linear Complexity

Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H.: Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020),https://arxiv.org/ abs/2006.04768

work page internal anchor Pith review arXiv 2006
[35]

doi:10.1109/CVPR.2018

Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR). pp. 7794–7803 (2018).https://doi.org/10.1109/CVPR.2018. 00813,https://openaccess.thecvf.com/content_cvpr_2018/papers/Wang_ Non-Local_Neural_Networks_CVPR_2018_paper.pdf

work page doi:10.1109/cvpr.2018 2018
[36]

Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han

Xiong, Y., Zeng, Z., Chakraborty, R., Tan, M., Fung, G., Li, Y., Singh, V.: Nys- trömformer: A nyström-based algorithm for approximating self-attention. In: Pro- Gaussian Kernel Attention 17 ceedings of the AAAI Conference on Artificial Intelligence (AAAI). pp. 14138– 14148 (2021).https://doi.org/10.1609/aaai.v35i16.17664,https://arxiv. org/abs/2102.03902

work page doi:10.1609/aaai.v35i16.17664 2021
[37]

In: Saul, L., Weiss, Y., Bottou, L

Yang, C., Duraiswami, R., Davis, L.S.: Efficient kernel machines using the improved fast gauss transform. In: Saul, L., Weiss, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems. vol. 17. MIT Press (2004),https : / / proceedings . neurips . cc / paper _ files / paper / 2004 / file / 85353d3b2f39b9c9b5ee3576578c04b7-Paper.pdf

2004
[38]

In: IEEE/CVF Inter- national Conference on Computer Vision (ICCV)

Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: Regularization strategy to train strong classifiers with localizable features. In: IEEE/CVF Inter- national Conference on Computer Vision (ICCV). pp. 6023–6032 (2019)

2019
[39]

Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., Choi, Y.: HellaSwag: Can a ma- chine really finish your sentence? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) (2019).https://doi.org/ 10.18653/v1/P19-1472,https://aclanthology.org/P19-1472/

work page doi:10.18653/v1/p19-1472 2019
[40]

In: International Conference on Learning Representations (ICLR) (2018)

Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations (ICLR) (2018)

2018
[41]

Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmenta- tion. Proceedings of the AAAI Conference on Artificial Intelligence34(07), 13001– 13008 (2020) 7 Attention Visualization Methodology To provide interpretable insight into the behavior of Gaussian Kernel Attention (GKA), we construct an exhaustive visualization pipeline that pr...

2020