Recognition: 3 theorem links
· Lean TheoremProjection-Free Transformers via Gaussian Kernel Attention
Pith reviewed 2026-05-08 19:25 UTC · model grok-4.3
The pith
Gaussian Kernel Attention replaces learned query-key projections in Transformers with a direct RBF kernel on per-head features.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Gaussian Kernel Attention computes token affinities directly using a Gaussian radial basis function kernel applied to per-head token features, with each head learning only a bandwidth parameter σ_h and a single output projection W_O to maintain compatibility. This replaces softmax(QK^T/sqrt(d))V and can be viewed as normalized kernel regression over tokens. In autoregressive language modeling with causal masking implemented via kernel renormalization, a depth-20 GKA model with 0.42× parameters and 0.49× training FLOPs trains stably, shows near-zero train-validation gap, and reaches competitive results on standard benchmarks despite higher bits-per-byte.
What carries the argument
Gaussian RBF kernel applied to per-head token features to produce attention weights, with per-head bandwidth σ_h and one shared output projection W_O.
If this is right
- GKA supplies an explicit locality scale through the per-head bandwidth parameter.
- The reduced parameter count and FLOPs enable stable training with a near-zero train-validation gap.
- Causal masking and sliding windows are realized simply by masking and renormalizing the kernel.
- The mechanism links modern Transformers to classical kernel smoothing and non-local filtering.
- It supplies one concrete dimension in the accuracy-efficiency trade-off space for attention design.
Where Pith is reading between the lines
- The explicit kernel form may make attention patterns easier to analyze or regularize than learned projections.
- At substantially larger scales the absence of Q/K projections could limit the model's ability to learn highly task-specific similarities.
- Hybrid architectures could apply GKA in early layers and standard attention in later layers to balance efficiency and expressivity.
Load-bearing premise
Raw per-head token features already contain enough information for a Gaussian kernel to generate useful attention weights without separate learned Q and K projections.
What would settle it
Training the GKA model at greater depth or on larger language corpora and observing that validation loss or downstream accuracy falls substantially behind a matched standard-attention baseline.
Figures
read the original abstract
Self-attention in Transformers is typically implemented as $\mathrm{softmax}(QK^\top/\sqrt{d})V$, where $Q=XW_Q$, $K=XW_K$, and $V=XW_V$ are learned linear projections of the input $X$. We ask whether these learned projections are necessary, or whether they can be replaced by a simpler similarity-based diffusion operator. We introduce \textbf{Gaussian Kernel Attention} (GKA), a drop-in replacement for dot-product attention that computes token affinities directly using a Gaussian radial basis function (RBF) kernel applied to per-head token features. Each head learns only a bandwidth parameter $\sigma_h$, while a single output projection $W_O$ preserves compatibility with the standard Transformer interface. GKA can be interpreted as normalized kernel regression over tokens, linking modern Transformer architectures to classical non-local filtering and kernel smoothing methods. We evaluate GKA in both vision and language modeling settings. For autoregressive language modeling within the \texttt{nanochat} framework, we implement causal masking and sliding-window constraints by masking and renormalizing the Gaussian kernel. At depth 20, a GKA model with $0.42\times$ the parameters and $0.49\times$ the total training FLOPs of a standard attention baseline trains stably, exhibits a near-zero train-validation gap, and demonstrates competitive behavior on standard benchmarks, albeit with higher bits-per-byte (BPB) at this compute scale. Overall, GKA provides a minimal, interpretable attention mechanism with an explicit locality scale, offering a dimension in the accuracy-efficiency trade-off for Transformer design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Gaussian Kernel Attention (GKA) as a drop-in replacement for standard self-attention in Transformers. Instead of learned Q and K projections, GKA computes affinities using a Gaussian RBF kernel on per-head token features, learning only a bandwidth σ_h per head and retaining the output projection W_O. It interprets this as normalized kernel regression. In experiments on autoregressive language modeling with the nanochat framework at depth 20, the GKA model uses 0.42× parameters and 0.49× FLOPs of a baseline, trains stably with near-zero train-validation gap, and shows competitive performance on benchmarks, though with higher bits-per-byte (BPB). Similar evaluations are mentioned for vision settings.
Significance. If the central empirical claims hold, the work provides a parameter-efficient and interpretable alternative to dot-product attention, reducing resources while achieving stable training. The explicit locality scale via σ_h and link to classical non-local filtering methods add conceptual value. The reported efficiency gains at depth 20 represent a concrete step toward more minimal Transformer architectures, though the higher BPB indicates a trade-off that merits further exploration. The demonstration of stable training with reduced FLOPs is a concrete strength.
major comments (3)
- [Abstract / experimental evaluation] Abstract and experimental results: the claims of 'competitive behavior' and stable training with 0.42× parameters / 0.49× FLOPs rest on outcomes after fitting σ_h, but no error bars, full benchmark tables, or ablations on bandwidth learning (learned vs. fixed σ_h) are provided. This makes it difficult to quantify the BPB gap or assess robustness of the efficiency claims.
- [Method (Gaussian Kernel Attention)] Method section (GKA definition): the load-bearing assumption that raw per-head token features (obtained by splitting the input X without learned Q/K projections) already align Euclidean distance with task-relevant similarities for the Gaussian kernel exp(-||x_i^h - x_j^h||^2 / (2σ_h^2)) is not directly tested. An ablation restoring projections while retaining the kernel, or measuring adaptation of learned σ_h, is required to substantiate that the raw features suffice.
- [Language modeling experiments] Language modeling experiments: the higher BPB is noted without quantification relative to the baseline at matched compute, and details on exact per-head feature extraction (splitting, any normalization) and causal masking/renormalization implementation are insufficient to evaluate whether the near-zero train-validation gap generalizes beyond the tested nanochat depth-20 scale.
minor comments (2)
- Clarify the precise equations for implementing causal masking and sliding-window constraints by masking and renormalizing the Gaussian kernel matrix.
- Expand the vision-setting results with comparable metrics and tables to the language modeling experiments for better cross-domain assessment.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for recognizing the potential of Gaussian Kernel Attention as a parameter-efficient alternative. We address each major comment below with targeted revisions to improve empirical robustness, substantiate key assumptions, and clarify implementation details.
read point-by-point responses
-
Referee: [Abstract / experimental evaluation] Abstract and experimental results: the claims of 'competitive behavior' and stable training with 0.42× parameters / 0.49× FLOPs rest on outcomes after fitting σ_h, but no error bars, full benchmark tables, or ablations on bandwidth learning (learned vs. fixed σ_h) are provided. This makes it difficult to quantify the BPB gap or assess robustness of the efficiency claims.
Authors: We agree that error bars, complete tables, and bandwidth ablations would strengthen the claims. In the revision we will report standard deviations from three independent runs for all metrics, include full benchmark tables comparing GKA against the baseline, and add an ablation of learned σ_h versus fixed values (e.g., σ_h = 1.0 and σ_h set to median pairwise distance). These additions will allow direct quantification of the BPB gap and robustness of the efficiency gains. revision: yes
-
Referee: [Method (Gaussian Kernel Attention)] Method section (GKA definition): the load-bearing assumption that raw per-head token features (obtained by splitting the input X without learned Q/K projections) already align Euclidean distance with task-relevant similarities for the Gaussian kernel exp(-||x_i^h - x_j^h||^2 / (2σ_h^2)) is not directly tested. An ablation restoring projections while retaining the kernel, or measuring adaptation of learned σ_h, is required to substantiate that the raw features suffice.
Authors: The referee correctly notes that this assumption is central and untested. We will add an ablation that restores learned Q and K projections, applies the Gaussian kernel to the projected features, and compares performance to the projection-free GKA. We will also report the distribution of learned σ_h values across heads and layers to demonstrate adaptation. This directly tests whether raw features suffice or whether projections remain beneficial. revision: yes
-
Referee: [Language modeling experiments] Language modeling experiments: the higher BPB is noted without quantification relative to the baseline at matched compute, and details on exact per-head feature extraction (splitting, any normalization) and causal masking/renormalization implementation are insufficient to evaluate whether the near-zero train-validation gap generalizes beyond the tested nanochat depth-20 scale.
Authors: We will expand the method section with explicit pseudocode for per-head splitting (simple channel partitioning after the input projection, no extra normalization), causal masking (upper-triangular mask on the kernel matrix followed by row-wise renormalization), and sliding-window constraints. For the BPB gap we will add a direct numerical comparison at the reported compute scale and discuss why a strictly matched-FLOP baseline would require new training runs; the primary efficiency claim remains the 0.42× parameter and 0.49× FLOP reduction at depth 20. These changes address the request for implementation transparency while acknowledging the scope of additional matched-compute experiments. revision: partial
Circularity Check
No circularity: empirical proposal with observed training outcomes
full rationale
The paper defines GKA via the explicit RBF kernel formula on raw per-head features, introduces learnable σ_h and W_O, implements causal masking by renormalization, and reports post-training metrics (parameter count, FLOPs, BPB, stability) on nanochat and vision benchmarks. No step claims a first-principles derivation of performance quantities; the kernel-regression interpretation is presented as an analogy, not a reduction that forces the measured BPB or gap. All load-bearing claims rest on external training runs rather than self-referential fitting or self-citation chains.
Axiom & Free-Parameter Ledger
free parameters (2)
- bandwidth σ_h per head
- output projection W_O
axioms (1)
- domain assumption Gaussian RBF kernel applied to token features can substitute for learned dot-product attention without major capacity loss
Lean theorems connected to this paper
-
IndisputableMonolith.Cost (Jcost = ½(x+x⁻¹)−1)washburn_uniqueness_aczel unclearK_ij^(h) = exp(−‖x_i^(h) − x_j^(h)‖² / (2σ_h²)), W^(h) = row_norm(K^(h)) ... Each head learns only a single scalar controlling locality.
-
IndisputableMonolith.Foundation.AlphaCoordinateFixation (parameter-free α-pin)alpha_pin_under_high_calibration unclearEach head learns only a bandwidth parameter σ_h, while a single output projection W_O preserves compatibility with the standard Transformer interface.
Reference graph
Works this paper leans on
-
[1]
In: Proceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL)
Abnar, S., Zuidema, W.: Quantifying attention flow in transformers. In: Proceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). pp. 4190–4197 (2020)
2020
-
[2]
Buades, A., Coll, B., Morel, J.M.: A non-local algorithm for image denoising. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). pp. 60–65 (2005).https://doi.org/10.1109/CVPR. 2005.38
-
[3]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9650–9660 (2021)
2021
-
[4]
In: Advances in Neural Information Processing Systems (NeurIPS)
Chen, Y., Zeng, Q., Ji, H., Yang, Y.: Skyformer: Remodel self-attention with gaus- sian kernel and nyström method. In: Advances in Neural Information Processing Systems (NeurIPS). pp. 2122–2135 (2021),https://arxiv.org/abs/2111.00035
-
[5]
In: International Conference on Learning Representations (ICLR) (2021),https://openreview.net/forum?id= Ua6zuk0WRH
Choromanski, K.M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J.Q., Mohiuddin, A., Kaiser, L., Belanger, D., Colwell, L., Weller, A.: Rethinking attention with performers. In: International Conference on Learning Representations (ICLR) (2021),https://openreview.net/forum?id= Ua6zuk0WRH
2021
-
[6]
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: Practical automated data augmentation with a reduced search space. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 702–703 (2020)
2020
-
[7]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Dao, T., Fu, D.Y., Ermon, S., Rudra, A., Ré, C.: Flashattention: Fast and memory- efficient exact attention with IO-awareness. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 35, pp. 16344–16359 (2022),https://arxiv. org/abs/2205.14135
work page internal anchor Pith review arXiv 2022
-
[8]
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large- scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 248–255 (2009)
2009
-
[9]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021),https: //openreview.net/forum?id=YicbFdNTT...
work page Pith review arXiv 2021
-
[10]
journal of machine learning research6(12) (2005)
Drineas, P., Mahoney, M.W., Cristianini, N.: On the nyström method for approx- imating a gram matrix for improved kernel-based learning. journal of machine learning research6(12) (2005)
2005
-
[11]
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Hoffer, E., Ben-Nun, T., Hubara, I., Giladi, N., Hoefler, T., Soudry, D.: Augment your batch: Improving generalization through instance repetition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8129–8138 (2020)
2020
-
[12]
In: European Conference on Computer Vision (ECCV)
Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: European Conference on Computer Vision (ECCV). pp. 646–
-
[13]
GitHub repository (2025),https : / / github
Karpathy, A.: nanochat. GitHub repository (2025),https : / / github . com / karpathy/nanochat, accessed 2026-02-28
2025
-
[14]
In: Proceedings of the 37th International Conference on Machine Learning (ICML)
Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are RNNs: Fast autoregressive transformers with linear attention. In: Proceedings of the 37th International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research, vol. 119, pp. 5156–5165. PMLR (2020),https://proceedings. mlr.press/v119/katharopoulos20a.html
2020
-
[15]
In: Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning (KR) (2012),https://cdn.aaai.org/ocs/4492/ 4492-21843-1-PB.pdf
Levesque, H.J., Davis, E., Morgenstern, L.: The winograd schema challenge. In: Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning (KR) (2012),https://cdn.aaai.org/ocs/4492/ 4492-21843-1-PB.pdf
2012
-
[16]
Datacomp- LM : In search of the next generation of training sets for language models
Li, J., Fang, A., Smyrnis, G., Ivgi, M., Jordan, M., Gadre, S., Bansal, H., Guha, E., Keh, S., Arora, K., et al.: Datacomp-lm: In search of the next generation of training sets for language models. arXiv preprint arXiv:2406.11794 (2024),https: //arxiv.org/abs/2406.11794
-
[17]
IEEE transactions on neural networks and learning systems26(1), 152–164 (2014)
Li, M., Bi, W., Kwok, J.T., Lu, B.L.: Large-scale nyström kernel matrix approxi- mation using randomized svd. IEEE transactions on neural networks and learning systems26(1), 152–164 (2014)
2014
-
[18]
arXiv preprint arXiv:2602.11534 (2026),https://arxiv.org/abs/2602.11534
Liu, J., Yue, Y., Welling, M., Song, Y.: Krause synchronization transformers. arXiv preprint arXiv:2602.11534 (2026),https://arxiv.org/abs/2602.11534
-
[19]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin trans- former: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 10012– 10022 (2021),https://openaccess.thecvf.com/content/ICCV2021/papers/ Liu _ Swin _ Transformer _ Hierarchical _ Visi...
2021
-
[20]
In: International Conference on Learning Representations (ICLR) (2019)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (ICLR) (2019)
2019
-
[21]
In: Advances in Neural Information Processing Systems (NeurIPS)
Luo, S., Li, S., Cai, T., He, D., Peng, D., Zheng, S., Ke, G., Wang, L., Liu, T.Y.: Stable, fast and accurate: Kernelized attention with relative positional encoding. In: Advances in Neural Information Processing Systems (NeurIPS). pp. 22795– 22807 (2021),https://arxiv.org/abs/2106.12566
-
[22]
Can a suit of armor conduct electricity? a new dataset for open book question answering
Mihaylov, T., Clark, P., Khot, T., Sabharwal, A.: Can a suit of armor conduct electricity? a new dataset for open book question answering. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2018).https://doi.org/10.18653/v1/D18-1260,https://aclanthology.org/ D18-1260/
-
[23]
GitHub repository (2024), https://github.com/mlfoundations/dclm 16 D.Kundu et al
ML Foundations: dclm: Datacomp for language models. GitHub repository (2024), https://github.com/mlfoundations/dclm 16 D.Kundu et al
2024
-
[24]
In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL)
Paperno,D.,Kruszewski,G.,Lazaridou,A.,Pham,N.Q.,Bernardi,R.,Pezzelle,S., Baroni, M., Boleda, G., Fernández, R.: The LAMBADA dataset: Word prediction requiring a broad discourse context. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL). pp. 1525–1534 (2016). https://doi.org/10.18653/v1/P16- 1144,https://acla...
-
[25]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Penedo, G., Kydlíček, H., Ben Allal, L., Lozhkov, A., Mitchell, M., Raffel, C., von Werra, L., Wolf, T.: The fineweb datasets: Decanting the web for the finest text data at scale. arXiv preprint arXiv:2406.17557 (2024),https://arxiv.org/abs/ 2406.17557
work page internal anchor Pith review arXiv 2024
-
[26]
In: Advances in Neural Information Processing Systems (NeurIPS) (2007),https: //papers.nips.cc/paper/3182- random- features- for- large- scale- kernel- machines
Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems (NeurIPS) (2007),https: //papers.nips.cc/paper/3182- random- features- for- large- scale- kernel- machines
2007
-
[27]
In: AAAI Spring Symposium Series: Logical Formalizations of Commonsense Reasoning (2011),https://cdn.aaai
Roemmele, M., Bejan, C.A., Gordon, A.S.: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In: AAAI Spring Symposium Series: Logical Formalizations of Commonsense Reasoning (2011),https://cdn.aaai. org/ocs/2418/2418-10878-1-PB.pdf
2011
-
[28]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
Sakaguchi,K.,LeBras,R.,Bhagavatula,C.,Choi,Y.:WinoGrande:Anadversarial winograd schema challenge at scale. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) (2020),https://arxiv.org/abs/1907.10641
work page internal anchor Pith review arXiv 2020
-
[29]
In: Workshop on High Performance Analytics- Algorithms, Implementations, and Applications, Siam Conference on Data Mining
Srinivasan, B.V., Hu, Q., Duraiswami, R., et al.: Gpuml: Graphical processors for speeding up kernel machines. In: Workshop on High Performance Analytics- Algorithms, Implementations, and Applications, Siam Conference on Data Mining. vol. 31 (2010)
2010
-
[30]
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the incep- tion architecture for computer vision. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2818–2826 (2016)
2016
-
[31]
In: Proceedings of the 38th International Conference on Machine Learning (ICML)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: Proceedings of the 38th International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research, vol. 139, pp. 10347–10357. PMLR (2021),https: //proceedings.mlr.press/v139/touvron21a.html
2021
-
[32]
In: International Conference on Machine Learning (ICML)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning (ICML). pp. 10347–10357. PMLR (2021)
2021
-
[33]
In: Advances in Neural Information Processing Systems (NeurIPS)
Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 30 (2017),https://arxiv.org/abs/1706. 03762
2017
-
[34]
Linformer: Self-Attention with Linear Complexity
Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H.: Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020),https://arxiv.org/ abs/2006.04768
work page internal anchor Pith review arXiv 2006
-
[35]
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR). pp. 7794–7803 (2018).https://doi.org/10.1109/CVPR.2018. 00813,https://openaccess.thecvf.com/content_cvpr_2018/papers/Wang_ Non-Local_Neural_Networks_CVPR_2018_paper.pdf
-
[36]
Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han
Xiong, Y., Zeng, Z., Chakraborty, R., Tan, M., Fung, G., Li, Y., Singh, V.: Nys- trömformer: A nyström-based algorithm for approximating self-attention. In: Pro- Gaussian Kernel Attention 17 ceedings of the AAAI Conference on Artificial Intelligence (AAAI). pp. 14138– 14148 (2021).https://doi.org/10.1609/aaai.v35i16.17664,https://arxiv. org/abs/2102.03902
-
[37]
In: Saul, L., Weiss, Y., Bottou, L
Yang, C., Duraiswami, R., Davis, L.S.: Efficient kernel machines using the improved fast gauss transform. In: Saul, L., Weiss, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems. vol. 17. MIT Press (2004),https : / / proceedings . neurips . cc / paper _ files / paper / 2004 / file / 85353d3b2f39b9c9b5ee3576578c04b7-Paper.pdf
2004
-
[38]
In: IEEE/CVF Inter- national Conference on Computer Vision (ICCV)
Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: Regularization strategy to train strong classifiers with localizable features. In: IEEE/CVF Inter- national Conference on Computer Vision (ICCV). pp. 6023–6032 (2019)
2019
-
[39]
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., Choi, Y.: HellaSwag: Can a ma- chine really finish your sentence? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) (2019).https://doi.org/ 10.18653/v1/P19-1472,https://aclanthology.org/P19-1472/
-
[40]
In: International Conference on Learning Representations (ICLR) (2018)
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations (ICLR) (2018)
2018
-
[41]
Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmenta- tion. Proceedings of the AAAI Conference on Artificial Intelligence34(07), 13001– 13008 (2020) 7 Attention Visualization Methodology To provide interpretable insight into the behavior of Gaussian Kernel Attention (GKA), we construct an exhaustive visualization pipeline that pr...
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.