arxiv: 2604.23798 · v1 · submitted 2026-04-26 · 💻 cs.LG · cs.CV

Recognition: unknown

ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers

Chih-Chung Hsu , Xin-Di Ma , Wo-Ting Liao , Chia-Ming Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:30 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords attention mechanismonline softmaxprefix scanlinear scan attentionmemory efficient attentionexact attentionvision transformershardware agnostic kernel

0 comments

The pith

ELSA reformulates online softmax attention as an exact prefix scan over an associative monoid to achieve linear memory and logarithmic parallel depth at full precision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ELSA as a way to compute attention exactly without sacrificing speed, memory, or numerical accuracy in transformers. It recasts the incremental online softmax computation as a parallelizable prefix scan that works the same way in real arithmetic and in floating point with a bounded error. This matters for long-sequence models because standard attention kernels either lose exactness, require specialized hardware instructions, or keep high sequential depth that hurts throughput. ELSA needs only linear extra memory and runs as a drop-in replacement on any device, from high-end GPUs to edge processors, without retraining. The result is a single kernel that delivers measured speedups in both FP32 and FP16 while keeping full-precision capability.

Core claim

ELSA casts the online softmax update as a prefix scan over the associative monoid (m, S, W). In real arithmetic this reproduces exact softmax semantics; in FP32 it carries a provable O(u log n) relative error bound. The scan uses O(n) extra memory and O(log n) parallel depth, is independent of Tensor Core instructions, and is realized in Triton and CUDA C++ as a drop-in replacement that requires no weight changes or retraining.

What carries the argument

The associative monoid (m, S, W) that encodes the online softmax update rules so that prefix scanning computes the same result as the sequential algorithm.

If this is right

Delivers 1.3–3.5× speedup over memory-efficient SDPA on A100 in FP32 for sequences of 1K–16K tokens.
Achieves 1.97–2.27× speedup on BERT inference under the same conditions.
Provides 1.5–1.6× throughput over standard math kernels on Jetson TX2 for sequences of 64–900 tokens.
Yields 17.8–20.2 % higher throughput for LLaMA-13B offloading at 32K tokens and beyond.
Supplies a single hardware-agnostic kernel that supports both high-precision FP32 and FP16 paths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The monoid scan pattern could be applied to other online normalization or aggregation steps inside transformers to reduce their sequential depth.
Because the kernel needs no retraining, it could serve as an immediate drop-in for any existing FP32 checkpoint deployed on mixed or resource-limited hardware.
Exact full-precision attention at long contexts may become practical on consumer GPUs that lack the latest Tensor Core instructions.
The approach separates the algorithmic reformulation from any particular hardware, so the same monoid definition could be ported to new accelerators without re-deriving numerical properties.

Load-bearing premise

The monoid (m, S, W) is associative and the Triton or CUDA implementation reproduces the online softmax update exactly apart from the stated floating-point error bound.

What would settle it

Execute both ELSA and a reference FP32 attention implementation on identical long input sequences and check whether their outputs differ by at most the O(u log n) relative error bound.

Figures

Figures reproduced from arXiv: 2604.23798 by Chia-Ming Lee, Chih-Chung Hsu, Wo-Ting Liao, Xin-Di Ma.

**Figure 1.** Figure 1: Integral-image analogy and ELSA computational flow. (a) Just as a prefix-sum table [6] enables O(1) range queries by composing boundary values, the softmax state (m, S, W) supports exact blockwise composition via ⊕ (Proposition 1). (b) ME-SDPA [5] chains tokens sequentially via the online softmax recurrence [29], incurring O(n) depth that becomes prohibitive at long sequences. (c) ELSA reduces depth to O(l… view at source ↗

**Figure 2.** Figure 2: Monoid merge operator ⊕. Three steps compose two block states: un exposes raw sums, element-wise + and max aggregate them, and renorm re-anchors to the new maximum. Concretely, let sj = ⟨q, kj ⟩/ √ d and mj = maxu≤j su. Define the unnormalized prefix quantities S˜ j = X u≤j e su , W˜ j = X u≤j e su vu, (3) and the normalized state (mj , Sj , Wj ) = (mj , S˜ j e −mj , W˜ j e −mj ). Normalization by e −mj ,… view at source ↗

**Figure 3.** Figure 3: Per-head latency vs. sequence length. Measured latency fitted to T(n) ≈ a L(n, B) + b n2 + c (orange dashed); vertical offsets track the logarithmic merge depth L(n, 128), validating O(log n) depth (Eq. 6) and O(n 2 ) work. Speedup (×) VRAM Savings (%) Model 224 336 448 560 224 336 448 560 ViT-B/16 1.46 1.63 1.79 1.98 0.0 2.3 17.8 36.1 ViT-B/32 1.03 – 1.42 – 0.0 – 0.0 – ViT-L/14 1.54 1.68 1.87 2.12 0.0 5… view at source ↗

**Figure 5.** Figure 5: Training convergence on ImageNet-1K. Training loss (gray) and top-1 accuracy (colored) for classification models with ELSA. Dashed lines indicate peak accuracy. All models converge smoothly, confirming that ELSA does not disturb optimization. ELSA is faster than FA2 on every ViT and Swin row (∼29– 65% on ViT, ∼2–14% on Swin), and also stays below the Math/WA memory reference on all ViT rows and on Swin-S/W… view at source ↗

**Figure 6.** Figure 6: further corroborates these findings: the heatmap (left) reveals no spatial structure in |δP| across token pairs— the uniformly dark field confirms that errors are random floatingpoint noise rather than systematic accumulation—while the logscale histogram (right) shows |δY | heavily concentrated near zero even under the stress scenario, with its tail bounded well below 2 × 10−14, orders of magnitude benea… view at source ↗

**Figure 7.** Figure 7: Blockwise monoid validation. Tokens are partitioned into blocks of 128; (m, S, W) is computed per block in parallel and then reduced via ⊕. The resulting drift metrics are indistinguishable from the global vectorized procedure, validating correctness of the associative merge view at source ↗

**Figure 8.** Figure 8: Memory footprint on synthetic benchmarks. ELSA’s advantage widens with sequence length in both precision modes. D. Ablation Studies We examine the impact of block size (B), accumulator precision, and resource-constrained deployment ( view at source ↗

read the original abstract

Existing attention accelerators often trade exact softmax semantics, depend on fused Tensor Core kernels, or incur sequential depth that limits FP32 throughput on long sequences. We present \textbf{ELSA}, an algorithmic reformulation of online softmax attention that (i)~preserves exact softmax semantics in real arithmetic with a \emph{provable} $\mathcal{O}(u\log n)$ FP32 relative error bound; (ii)~casts the online softmax update as a prefix scan over an associative monoid $(m,S,W)$, yielding $O(n)$ extra memory and $O(\log n)$ parallel depth; and (iii)~is Tensor-Core independent, implemented in Triton and CUDA C++, and deployable as a \emph{drop-in replacement} requiring no retraining or weight modification. Unlike FlashAttention-2/3, which rely on HMMA/GMMA Tensor Core instructions and provide no compatible FP32 path, ELSA operates identically on A100s and resource-constrained edge devices such as Jetson TX2 -- making it the only hardware-agnostic exact-attention kernel that reduces parallel depth to $O(\log n)$ at full precision. On A100 FP32 benchmarks (1K--16K tokens), ELSA delivers $1.3$--$3.5\times$ speedup over memory-efficient SDPA and $1.97$--$2.27\times$ on BERT; on Jetson TX2, ELSA achieves $1.5$--$1.6\times$ over Math (64--900 tokens), with $17.8$--$20.2\%$ throughput gains under LLaMA-13B offloading at $\ge$32K. In FP16, ELSA approaches hardware-fused baselines at long sequences while retaining full FP32 capability, offering a unified kernel for high-precision inference across platforms. Our code and implementation are available at https://github.com/ming053l/ELSA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ELSA recasts online softmax as a prefix scan over an associative monoid to deliver exact attention at O(log n) depth with O(n) extra memory and no tensor-core dependence.

read the letter

The main contribution here is the monoid (m, S, W) construction that lets the online softmax update run as a parallel prefix scan while keeping real-arithmetic semantics and a stated O(u log n) FP32 relative error. That is a clean algorithmic move not directly covered by the tiling or fused-kernel work cited in the abstract. The paper also shows the kernel works as a drop-in replacement on both A100 FP32 and Jetson TX2, with reported speedups of 1.3–3.5× over memory-efficient SDPA and solid gains under LLaMA offloading at 32K tokens. Those hardware-agnostic results are useful for anyone who needs full-precision long-sequence inference without retraining or platform-specific tensor-core paths. The implementation in Triton and CUDA C++ is a practical plus. The soft spot is exactly where the stress-test note points: the abstract asserts associativity of the monoid and that the scan reproduces the online update up to the claimed floating-point bound, but the provided text does not include the explicit monoid definition or the error-propagation argument. Without those, it is hard to judge whether the Triton/CUDA realization stays within the bound or introduces extra accumulation-order effects. The speedups look reasonable on the reported lengths, yet the baselines and full experimental controls are only sketched. This is the kind of paper that belongs in a reading group focused on attention kernels and parallel scan techniques; the algorithmic framing is worth discussing even if the proof needs tightening. It deserves a serious referee because the idea is self-contained, the practical claims are testable, and the hardware-independence angle addresses a real gap. I would send it out for review with a request to see the monoid associativity proof and the exact numerical verification in the supplement.

Referee Report

2 major / 1 minor

Summary. The paper introduces ELSA as an algorithmic reformulation of online softmax attention for vision transformers. It claims to preserve exact softmax semantics in real arithmetic via a prefix scan over an associative monoid (m, S, W), with a provable O(u log n) FP32 relative error bound, O(n) extra memory, and O(log n) parallel depth. The method is Tensor-Core independent, implemented in Triton and CUDA C++ as a drop-in replacement requiring no retraining, and reports speedups of 1.3-3.5x over memory-efficient SDPA on A100 FP32 (1K-16K tokens) and 1.5-1.6x on Jetson TX2, with additional gains for LLaMA-13B offloading at long sequences.

Significance. If the monoid associativity and implementation fidelity hold, ELSA would represent a meaningful advance as a hardware-agnostic exact-attention kernel that achieves logarithmic parallel depth at full precision without tensor-core reliance. The open-source code release at the provided GitHub link is a clear strength supporting reproducibility and falsifiability of the claimed error bound and speedups.

major comments (2)

Abstract: the central claim that the online softmax update is exactly equivalent to a prefix scan over the associative monoid (m,S,W) (apart from the stated FP32 bound) is load-bearing, yet the explicit definitions of the monoid operations and the proof of associativity are not exhibited, preventing verification that the Triton/CUDA implementation reproduces standard attention without unaccounted numerical or scheduling discrepancies.
Experimental section: the reported speedups (e.g., 1.3-3.5x over SDPA on A100 FP32) lack detailed baseline kernel descriptions, full experimental setup (including whether FP32 paths were isolated), and empirical validation of the O(u log n) error bound against reference attention, undermining assessment of the practical claims.

minor comments (1)

Abstract: the title specifies vision transformers, yet benchmarks include BERT and LLaMA-13B; clarify the intended scope and whether the method generalizes beyond vision tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and will revise the manuscript accordingly to improve verifiability and experimental rigor.

read point-by-point responses

Referee: Abstract: the central claim that the online softmax update is exactly equivalent to a prefix scan over the associative monoid (m,S,W) (apart from the stated FP32 bound) is load-bearing, yet the explicit definitions of the monoid operations and the proof of associativity are not exhibited, preventing verification that the Triton/CUDA implementation reproduces standard attention without unaccounted numerical or scheduling discrepancies.

Authors: We agree that explicit monoid definitions and the associativity proof are necessary for full verification. In the revised version we will add the formal definitions of the monoid operations (m, S, W) together with a self-contained proof of associativity, placed in the main text or a new appendix. We will also include pseudocode that directly maps the prefix-scan formulation to the Triton and CUDA kernels, clarifying that the implementation follows the mathematical specification without additional numerical transformations or scheduling artifacts beyond the stated FP32 error bound. revision: yes
Referee: Experimental section: the reported speedups (e.g., 1.3-3.5x over SDPA on A100 FP32) lack detailed baseline kernel descriptions, full experimental setup (including whether FP32 paths were isolated), and empirical validation of the O(u log n) error bound against reference attention, undermining assessment of the practical claims.

Authors: We accept that the experimental section requires additional detail. We will expand it to provide: (i) explicit descriptions of all baseline kernels (including the exact PyTorch SDPA configuration and whether the FP32 path was forced), (ii) complete experimental setup information (hardware, batch sizes, sequence lengths, and confirmation that all timing and accuracy runs used isolated FP32 arithmetic), and (iii) new empirical results comparing ELSA outputs against a reference full-attention implementation to validate the O(u log n) relative-error bound across the reported sequence lengths. These additions will be included in the revised manuscript and supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: ELSA is an independent algorithmic reformulation

full rationale

The provided abstract and description present ELSA as a direct reformulation of online softmax attention into a prefix scan over the monoid (m,S,W), together with a claimed provable FP32 error bound and hardware-agnostic implementation. No equations, parameters, or premises in the visible text reduce by construction to fitted inputs, self-citations, or prior results from the same authors. The associativity claim and error bound are asserted as mathematical properties of the reformulation rather than presupposed by the inputs; the derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the mathematical properties of the introduced monoid and standard floating-point error analysis; no free parameters or new physical entities are introduced.

axioms (2)

domain assumption The monoid operations (m, S, W) are associative
Required for the prefix scan to compute the correct online softmax result.
standard math Floating-point relative error is bounded by O(u log n) under the scan
Standard machine-precision error propagation invoked for the FP32 guarantee.

pith-pipeline@v0.9.0 · 5675 in / 1442 out tokens · 43273 ms · 2026-05-08T06:30:09.900191+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 13 canonical work pages · 3 internal anchors

[1]

Blelloch

Guy E. Blelloch. Prefix sums and their applications. In Synthesis of Parallel Algorithms, pages 35–60, 1990. 3, 4, and 5

1990
[2]

Blelloch

Guy E. Blelloch. Prefix sums and their applications. Tech- nical Report CMU-CS-90-190, Carnegie Mellon University,
[3]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating Long Sequences with Sparse Transformers. In ICML, 2019. 2

2019
[4]

Rethink- ing Attention with Performers

Krzysztof Choromanski, Valerii Likhosherstov, David Do- han, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethink- ing Attention with Performers. In ICLR, 2021. 1, 2, and 6

2021
[5]

xformers: A modular and memory-efficient transformer library

Alexander Clark, Trevor Bean, et al. xformers: A modular and memory-efficient transformer library. In Open Source Software Track at NeurIPS, 2022.https: //github.com/facebookresearch/xformers. 1, 3, 4, 5, and 6

2022
[6]

Franklin C. Crow. Summed-area tables for texture mapping. In ACM SIGGRAPH, pages 207–212, 1984. 4

1984
[7]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023. 1, 2, 5, and 8

work page internal anchor Pith review arXiv 2023
[8]

Fu, Stefano Ermon, Atri Rudra, and Christopher R´e

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In NeurIPS, 2022. 6

2022
[9]

Patch n’ pack: NaViT, a vision transformer for any aspect ratio and resolution

Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim Alabdulmohsin, Avital Oliver, Piotr Padlewski, Alexey Gritsenko, Mario Luˇci´c, and Neil Houlsby. Patch n’ pack: NaViT, a vision transformer for any aspect ratio and resolution. In NeurIPS ,
[10]

BERT: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional trans- formers for language understanding. In NAACL, 2019. 8

2019
[11]

An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR, 2021. 1 and 6

2021
[12]

Vi- TAR: Vision transformer with any resolution, 2024

Qihang Fan, Quanzeng You, Xiaotian Han, Yongfei Liu, Yunzhe Tao, Huaibo Huang, Ran He, and Hongxia Yang. Vi- TAR: Vision transformer with any resolution, 2024. 1 and 2

2024
[13]

Attention as an RNN

Leo Feng, Frederick Tung, Mohamed Osama Ahmed, Yoshua Bengio, and Hossein Hajiloo. Attention as an RNN. arXiv preprint arXiv:2405.13956, 2024. 2, 3, and 4

work page arXiv 2024
[14]

Franz A. Heinsen. Softmax Attention with Constant Cost per Token. arXiv preprint arXiv:2404.05843, 2024. 3

work page arXiv 2024
[15]

Higham.Accuracy and Stability of Numerical Algorithms

Nicholas J. Higham.Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia, PA, second edition, 2002. 5 and 3

2002
[16]

Daniel Hillis and Guy L

W. Daniel Hillis and Guy L. Steele Jr. Data parallel al- gorithms. Communications of the ACM, 29(12):1170–1183,
[17]

SpectralFormer: Re- thinking hyperspectral image classification with transform- ers

Danfeng Hong, Zhu Han, Jing Yao, Lianru Gao, Bing Zhang, Antonio Plaza, and Jocelyn Chanussot. SpectralFormer: Re- thinking hyperspectral image classification with transform- ers. TGRS, 2022. 2

2022
[18]

Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment Anything. In ICCV , 2023. 1 and 2

2023
[19]

Re- former: The Efficient Transformer

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Re- former: The Efficient Transformer. In ICLR, 2020. 2

2020
[20]

Kogge and Harold S

Peter M. Kogge and Harold S. Stone. A parallel algorithm for the efficient solution of a general class of recurrence equa- tions. IEEE Transactions on Computers , C-22(8):786–793,
[21]

Win-win: Training high-resolution vision transformers from two windows

Vincent Leroy, Jerome Revaud, Thomas Lucas, and Philippe Weinzaepfel. Win-win: Training high-resolution vision transformers from two windows. In ICLR, 2024. 1 and 2

2024
[22]

P., Gonzalez, J

Dacheng Li, Rulin Shao, Anze Xie, Eric P. Xing, Xuezhe Ma, Ion Stoica, Joseph E. Gonzalez, and Hao Zhang. DISTFLASHATTN: Distributed memory- efficient attention for long-context LLMs training. arXiv preprint arXiv:2310.03294, 2023. 3

work page arXiv 2023
[23]

Microsoft COCO: Common objects in context

Tsung-Yi Lin et al. Microsoft COCO: Common objects in context. In ECCV, 2014. 8

2014
[24]

EfficientViT: Memory efficient vision transformer with cascaded group attention

Xinyu Liu, Houwen Peng, Ningxin Zheng, Yuqing Yang, Han Hu, and Yixuan Yuan. EfficientViT: Memory efficient vision transformer with cascaded group attention. In CVPR,
[25]

Swin Trans- former: Hierarchical vision transformer using shifted win- dows

Ze Liu, Yutong Lin, Yixuan Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Trans- former: Hierarchical vision transformer using shifted win- dows. In ICCV, 2021. 6

2021
[26]

Maas et al

Andrew L. Maas et al. Learning word vectors for sentiment analysis. In ACL, 2011. 8

2011
[27]

Online normalizer calculation for softmax,

Maxim Milakov and Natalia Gimelshein. Online normalizer calculation for softmax. arXiv preprint arXiv:1805.02867 ,

work page arXiv
[28]

CUB: CUDA unbound.https:// github.com/NVIDIA/cccl, 2024

NVIDIA Corporation. CUB: CUDA unbound.https:// github.com/NVIDIA/cccl, 2024. 7

2024
[29]

Self-attention does not needo(n 2)memory.arXiv preprint arXiv:2112.05682,

Markus N. Rabe and Charles Staats. Self-attention does not needO(n 2)memory. arXiv preprint arXiv:2112.05682 ,

work page arXiv
[30]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Peter Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML , 2021. 1, 2, and 6

2021
[31]

Berg, and Li Fei-Fei

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet large scale visual recognition chal- lenge. IJCV, 115(3):211–252, 2015. 6

2015
[32]

LeanAttention: Hardware- aware scalable attention mechanism for the decode-phase of transformers

Rya Sanovar, Srikant Bharadwaj, Ren ´ee St Amant, Victor R¨uhle, and Saravan Rajmohan. LeanAttention: Hardware- aware scalable attention mechanism for the decode-phase of transformers. In MLSys, 2024. 3 and 4

2024
[33]

FlashAttention-3: Fast and accurate attention with asynchrony and low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. FlashAttention-3: Fast and accurate attention with asynchrony and low-precision. NeurIPS, 37:68658–68685, 2024. 1, 2, 5, 6, and 8

2024
[34]

Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025

You Shen, Zhipeng Zhang, Yansong Qu, and Liujuan Cao. FastVGGT: Training-free acceleration of visual geometry transformer. arXiv preprint arXiv:2509.02560, 2025. 6

work page arXiv 2025
[35]

Recursive deep models for seman- tic compositionality over a sentiment treebank

Richard Socher et al. Recursive deep models for seman- tic compositionality over a sentiment treebank. In EMNLP ,
[36]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 1 and 7

work page internal anchor Pith review arXiv 2023
[37]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. In NeurIPS, 2017. 1

2017
[38]

Paral- lelizing linear transformers with the delta rule over sequence length

Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Paral- lelizing linear transformers with the delta rule over sequence length. In NeurIPS, 2024. 2

2024
[39]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. In CVPR , 2025. 1, 2, and 6

2025
[40]

Linformer: Self-Attention with Linear Complexity

Sinong Wang, Belinda Z. Cao, Jing Zhang, Chang Li, Kevin Xu, and Zhiyuan Liu. Linformer: Self-Attention with linear complexity. In arXiv preprint arXiv:2006.04768, 2020. 1, 2, and 6

work page internal anchor Pith review arXiv 2006
[41]

HSIMAE: A unified masked autoencoder with large-scale pre-training for hyper- spectral image classification

Yue Wang, Ming Wen, Hailiang Zhang, Jinyu Sun, Qiong Yang, Zhimin Zhang, and Hongmei Lu. HSIMAE: A unified masked autoencoder with large-scale pre-training for hyper- spectral image classification. TGRS, 2024. 6 and 8

2024
[42]

ELFATT: Efficient linear fast attention for vi- sion transformers

Chong Wu, Maolin Che, Renjie Xu, Zhuoheng Ran, and Hong Yan. ELFATT: Efficient linear fast attention for vi- sion transformers. In ACM MM , 2025. arXiv:2501.06098. 3

work page arXiv 2025
[43]

Nystr¨omformer: A nystr ¨om-based approximation for self- attention

Yunyang Xiong, Zhengyan Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nystr¨omformer: A nystr ¨om-based approximation for self- attention. In AAAI, 2021. 2

2021
[44]

HIRI-ViT: Scaling vision transformer with high resolution inputs, 2024

Ting Yao, Yehao Li, Yingwei Pan, and Tao Mei. HIRI-ViT: Scaling vision transformer with high resolution inputs, 2024. 1

2024
[45]

SHViT: Single-head vision transformer with memory efficient macro design

Seokju Yun and Youngmin Ro. SHViT: Single-head vision transformer with memory efficient macro design. In CVPR ,
[46]

Big Bird: Transformers for longer sequences

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Onta ˜n´on, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big Bird: Transformers for longer sequences. In NeurIPS, 2020. 1, 2, and 6

2020
[47]

Memory efficient transformer adapter for dense pre- dictions

Dong Zhang, Rui Yan, Pingcheng Dong, and Kwang-Ting Cheng. Memory efficient transformer adapter for dense pre- dictions. In ICLR, 2025. arXiv:2502.01962. 3

work page arXiv 2025
[48]

Multi-scale vision long- former: A new vision transformer for high-resolution image encoding

Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, and Jianfeng Gao. Multi-scale vision long- former: A new vision transformer for high-resolution image encoding. In ICCV, pages 2998–3008, 2021. 1

2021
[49]

The linear attention resurrection in vision transformer

Chuanyang Zheng. The linear attention resurrection in vision transformer. arXiv preprint arXiv:2501.16182, 2025. 3

work page arXiv 2025
[50]

Spectral–spatial transformer network for hyperspec- tral image classification: A factorized architecture search framework

Zilong Zhong, Ying Li, Lin Ma, Jonathan Li, and Wei-Shi Zheng. Spectral–spatial transformer network for hyperspec- tral image classification: A factorized architecture search framework. TGRS, 2021. 2 and 8 ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers Supplementary Material Supplementary Overview This supplementary provid...

work page arXiv 2021