pith. machine review for the scientific record. sign in

arxiv: 2605.12491 · v1 · submitted 2026-05-12 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Elastic Attention Cores for Scalable Vision Transformers

Alan Z. Song, Andrew F. Luo, Deva Ramanan, Hossein Adeli, Jiahang Cao, Michael J. Tarr, Mu Nan, Muquan Yu, Rui Zhang, Weijian Mai, Yinjie Chen

Pith reviewed 2026-05-13 05:57 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords vision transformersscalable attentioncore-periphery attentionlinear complexityelastic trainingimage classificationdense predictionhigh-resolution vision
0
0 comments X

The pith

Vision transformers can learn rich visual representations without any direct patch-to-patch interactions by routing everything through a small set of learned core tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the core assumption behind standard vision transformers that every patch must attend directly to every other patch. It instead introduces a core-periphery structure where a fixed, resolution-independent set of learned core tokens serves as the only communication channel for the image patches. Because patches interact solely with these cores rather than with one another, the attention cost becomes linear in the number of patches. The architecture keeps the full set of original patch tokens throughout the network and uses nested training along the core dimension to allow accuracy-compute trade-offs at inference time. If this holds, it removes the quadratic barrier that has limited vision transformers to lower-resolution inputs.

Core claim

VECA shows that effective visual-semantic representations emerge when patch tokens exchange information exclusively through a small, learned set of core embeddings that are initialized from scratch and updated across layers, producing linear O(N) complexity for fixed core count C while preserving competitive accuracy on both classification and dense prediction tasks.

What carries the argument

Visual Elastic Core Attention (VECA), in which a fixed set of C learned core tokens functions as the sole communication interface: every patch token attends only to the cores, and the cores attend to all patches, with the cores being propagated and updated layer by layer.

If this is right

  • Image resolution can increase without a quadratic explosion in attention cost, since only the linear term in N remains.
  • Nested training along the core axis lets a single model deliver a continuous range of accuracy-compute operating points at inference.
  • The full set of N patch tokens is retained rather than collapsed into a small bottleneck, preserving spatial information for dense tasks.
  • The same core-periphery pattern can be stacked or combined with other efficient attention variants without reintroducing quadratic scaling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach suggests that many sequence-modeling domains may benefit from replacing all-to-all attention with learned hub-and-spoke communication structures.
  • Because cores are learned from scratch rather than derived from the input, they may act as a form of implicit global memory that could transfer across tasks or datasets.
  • Varying C during training and testing offers a practical knob for deploying the same weights on devices with different speed-accuracy requirements.

Load-bearing premise

A small fixed set of learned core tokens can capture and propagate all necessary visual-semantic information across patches without any direct patch-to-patch interactions.

What would settle it

Train VECA and a standard quadratic ViT on the same high-resolution dense-prediction benchmark with matched parameter count; if VECA accuracy remains within a few percent of the ViT while using far less compute, the claim is supported; a large gap would falsify it.

Figures

Figures reproduced from arXiv: 2605.12491 by Alan Z. Song, Andrew F. Luo, Deva Ramanan, Hossein Adeli, Jiahang Cao, Michael J. Tarr, Mu Nan, Muquan Yu, Rui Zhang, Weijian Mai, Yinjie Chen.

Figure 1
Figure 1. Figure 1: Outline of our core-periphery attention structure. (a) Self-attention utilizes a fully connected attention matrix with N2 comparisons given N input patches. VECA constructs a core￾periphery matrix with C core tokens that form a clique, requiring only 2NC + C 2 comparisons. (b) We visualize our dense features with UMAP. Our method produce stable embeddings across resolutions. (c) Top: We visualize our featu… view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of VECA. (a) Our attention matrix is defined using a core-periphery structure with a graph diameter of 2. For Np patch tokens, and C active core tokens, the total connections are 2NC + C 2 . For each layer: we transform the patches, cores, and predict the spatial coordinate for each core. (b) By varying the number of core tokens, the core-patch attention granularity changes. With fewer cores, … view at source ↗
Figure 3
Figure 3. Figure 3: Learned dense representations. We compare the UMAP visualizations of dense features and [CLS]-dense cosine similarity maps under increasing input resolutions with those of representative ViT backbones. The learned representations of VECA demonstrate high-quality clean dense features and remain robust at higher resolutions. inferred via bayesian methods, density estimation, or pruning [101, 102, 103, 104, 1… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of semantic consistency. We visualize the cosine similarity between a selected patch token, marked by a red cross, and all other tokens to examine the semantic consistency of the learned dense visual representations. Brighter colors indicate patches with higher cosine similarity to the selected token. We observe that the representations focus on spatially coherent and semantically consistent … view at source ↗
Figure 5
Figure 5. Figure 5: Results of varying the core-token budget. (a) Dense prediction performance improves consistently as more core tokens are activated. We report linear-probe results on Context, Stuff, and NYUv2 at 768 res while sweeping the active budget from C = 8 to C = 64. (b) Qualitative patch-to-patch similarity visualizations across budgets. Larger core-token budgets produce more semantically coherent and spatially loc… view at source ↗
Figure 6
Figure 6. Figure 6: Complementary visual roles of core tokens. We visualize the patterns of different core tokens using the cosine similarity between selected core tokens and spatial patch features. Warmer colors indicate higher similarity. We find that different core tokens respond to distinct spatial regions, suggesting that the learned set of core tokens captures complementary visual information rather than duplicating the… view at source ↗
Figure 7
Figure 7. Figure 7: From isotropic to semantic clustering. We visualize the development of core token attention weights with UMAP during inference under different active token budgets. The results show that core tokens progressively cluster into semantically coherent regions, evolving from spherical and diffuse to structured representations. Please zoom in for details. 5 Discussion Limitations and Future Work. VECA is trained… view at source ↗
read the original abstract

Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pairwise token interactions are necessary for learning rich visual-semantic representations. In this work, we challenge this assumption, demonstrating that effective visual representations can be learned without any direct patch-to-patch interaction. We propose VECA (Visual Elastic Core Attention), a vision transformer architecture that uses efficient linear-time core-periphery structured attention enabled by a small set of learned cores. In VECA, these cores act as a communication interface: patch tokens exchange information exclusively through the core tokens, which are initialized from scratch and propagated across layers. Because the $N$ image patches only directly interact with a resolution invariant set of $C$ learned "core" embeddings, this yields linear complexity $O(N)$ for predetermined $C$, which bypasses quadratic scaling. Compared to prior cross-attention architectures, VECA maintains and iteratively updates the full set of $N$ input tokens, avoiding a small $C$-way bottleneck. Combined with nested training along the core axis, our model can elastically trade off compute and accuracy during inference. Across classification and dense tasks, VECA achieves performance competitive with the latest vision foundation models while reducing computational cost. Our results establish elastic core-periphery attention as a scalable alternative building block for Vision Transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes VECA (Visual Elastic Core Attention), a Vision Transformer architecture that replaces all-to-all self-attention with a core-periphery structure: a small fixed set of C learned core tokens serves as the sole communication interface, so that the N image patches interact only via patch-to-core and core-to-patch attention. This yields O(NC) complexity per layer (linear in N for fixed C), maintains the full set of N tokens across layers, and supports nested training for elastic compute-accuracy trade-offs at inference. The central claim is that effective visual representations for both classification and dense prediction can be learned without any direct patch-to-patch interactions, achieving performance competitive with recent vision foundation models while avoiding quadratic scaling.

Significance. If the empirical claims hold, the work supplies a concrete, resolution-invariant building block that removes the quadratic bottleneck of standard ViT attention while preserving the full token set. The direct derivation of linear complexity from the bipartite routing rule and the nested-training mechanism for elastic inference are clear strengths. The result would be relevant to high-resolution vision and resource-constrained deployment, provided the core bottleneck demonstrably preserves the necessary spatial and semantic relations.

major comments (3)
  1. [§3] §3 (Architecture): the claim that iterative patch-to-core and core-to-patch attention recovers the expressivity of full self-attention for visual semantics lacks any capacity argument, derivation, or proof that arbitrary higher-order patch dependencies can be mediated exactly through the fixed C-dimensional bottleneck. The bipartite structure is stated to propagate information, but no analysis shows the composition of updates equals the function class of all-to-all attention.
  2. [Experiments] Experiments on dense tasks (e.g., segmentation or detection sections): local spatial structure must survive repeated compression through the C cores; the manuscript supplies no targeted ablations, attention-map visualizations, or controlled comparisons that isolate whether fine-grained locality is preserved or lost relative to standard self-attention baselines.
  3. [Experiments] Table/figure reporting quantitative results: the abstract asserts competitive performance on classification and dense tasks, yet the provided manuscript excerpt contains no numerical tables, baselines, error bars, or ablation details; the central empirical claim therefore rests on unverified assertions until these are supplied.
minor comments (2)
  1. [§3] Notation for core initialization and propagation across layers is described only in prose; an explicit update equation would clarify whether cores are re-initialized or carried forward.
  2. [Introduction] The term 'elastic' is used for both the core-periphery routing and the nested training schedule; a single sentence distinguishing the two usages would avoid ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below, providing clarifications and indicating planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Architecture): the claim that iterative patch-to-core and core-to-patch attention recovers the expressivity of full self-attention for visual semantics lacks any capacity argument, derivation, or proof that arbitrary higher-order patch dependencies can be mediated exactly through the fixed C-dimensional bottleneck. The bipartite structure is stated to propagate information, but no analysis shows the composition of updates equals the function class of all-to-all attention.

    Authors: We agree that the manuscript does not contain a formal capacity argument, derivation, or proof establishing that the bipartite core-periphery updates exactly recover the function class of all-to-all self-attention. The work is primarily empirical, demonstrating that competitive visual representations can be learned via the learned cores as an information bottleneck. We will revise §3 to include an explicit discussion of this limitation, clarifying that the architecture relies on iterative mediation through the cores rather than claiming theoretical equivalence, and we will note this as an avenue for future theoretical analysis. revision: partial

  2. Referee: [Experiments] Experiments on dense tasks (e.g., segmentation or detection sections): local spatial structure must survive repeated compression through the C cores; the manuscript supplies no targeted ablations, attention-map visualizations, or controlled comparisons that isolate whether fine-grained locality is preserved or lost relative to standard self-attention baselines.

    Authors: We recognize the need for explicit verification that fine-grained locality survives the core bottleneck. The full manuscript includes attention visualizations and ablations for dense tasks, but we will add new targeted ablations and controlled comparisons (e.g., locality metrics and direct baseline contrasts) in the revision to isolate preservation of spatial structure. revision: yes

  3. Referee: [Experiments] Table/figure reporting quantitative results: the abstract asserts competitive performance on classification and dense tasks, yet the provided manuscript excerpt contains no numerical tables, baselines, error bars, or ablation details; the central empirical claim therefore rests on unverified assertions until these are supplied.

    Authors: The referee's observation applies to the limited excerpt provided for review. The complete manuscript contains full tables reporting quantitative results on classification and dense tasks, with baselines, standard deviations, and ablation details. We will ensure these tables are prominently featured and clearly linked to all empirical claims in the revised version. revision: no

Circularity Check

0 steps flagged

No circularity: linear complexity follows directly from architecture definition

full rationale

The paper defines VECA via core-periphery routing where each of N patches attends exclusively to a fixed set of C cores (and cores attend back to patches). The stated O(N) complexity is obtained by direct operation counting on this bipartite structure for predetermined C; it is not a fitted prediction or self-referential claim. No equations reduce any performance result to the inputs by construction, no uniqueness theorems are imported via self-citation, and no ansatz is smuggled in. The claim that cores suffice to propagate visual semantics is presented as an empirical hypothesis validated by experiments rather than a definitional necessity. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the unproven sufficiency of learned cores as a complete communication interface; no independent evidence for this sufficiency is supplied in the abstract.

free parameters (1)
  • C (number of cores)
    Predetermined resolution-invariant core count; acts as a hyperparameter controlling the compute-accuracy tradeoff.
axioms (1)
  • domain assumption A small fixed set of learned cores can serve as a sufficient communication bottleneck for all visual information exchange.
    Invoked to justify removal of direct patch-to-patch attention while preserving representation quality.
invented entities (1)
  • learned core embeddings no independent evidence
    purpose: Act as communication interface and information carriers across layers for the full set of patch tokens.
    Newly introduced tokens initialized from scratch; no external evidence of their sufficiency is provided.

pith-pipeline@v0.9.0 · 5601 in / 1212 out tokens · 37372 ms · 2026-05-13T05:57:16.855170+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

181 extracted references · 181 canonical work pages · 21 internal anchors

  1. [1]

    Mlp-mixer: An all-mlp architecture for vision.Advances in neural information processing systems, 34: 24261–24272, 2021

    Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision.Advances in neural information processing systems, 34: 24261–24272, 2021

  2. [2]

    Resmlp: Feedforward networks for image classification with data-efficient training.IEEE transactions on pattern analysis and machine intelligence, 45(4):5314–5321, 2022

    Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, et al. Resmlp: Feedforward networks for image classification with data-efficient training.IEEE transactions on pattern analysis and machine intelligence, 45(4):5314–5321, 2022

  3. [3]

    Pay attention to mlps.Advances in neural information processing systems, 34:9204–9215, 2021

    Hanxiao Liu, Zihang Dai, David So, and Quoc V Le. Pay attention to mlps.Advances in neural information processing systems, 34:9204–9215, 2021

  4. [4]

    Fnet: Mixing tokens with fourier transforms

    James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, and Santiago Ontanon. Fnet: Mixing tokens with fourier transforms. InProceedings of the 2022 Conference of the north American chapter of the Association for Computational Linguistics: human language technologies, pages 4296–4313, 2022

  5. [5]

    Gfnet: Global filter networks for visual recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10960–10973, 2023

    Yongming Rao, Wenliang Zhao, Zheng Zhu, Jie Zhou, and Jiwen Lu. Gfnet: Global filter networks for visual recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10960–10973, 2023

  6. [6]

    Understanding the effective receptive field in deep convolutional neural networks.Advances in neural information processing systems, 29, 2016

    Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive field in deep convolutional neural networks.Advances in neural information processing systems, 29, 2016

  7. [7]

    How much position information do convolutional neural networks encode?arXiv preprint arXiv:2001.08248, 2020

    Md Amirul Islam, Sen Jia, and Neil DB Bruce. How much position information do convolutional neural networks encode?arXiv preprint arXiv:2001.08248, 2020

  8. [8]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  9. [9]

    Do vision transformers see like convolutional neural networks?Advances in neural information processing systems, 34:12116–12128, 2021

    Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision transformers see like convolutional neural networks?Advances in neural information processing systems, 34:12116–12128, 2021

  10. [10]

    How do vision transformers work?arXiv preprint arXiv:2202.06709, 2022

    Namuk Park and Songkuk Kim. How do vision transformers work?arXiv preprint arXiv:2202.06709, 2022

  11. [11]

    Rethinking vision transformers for mobilenet size and speed

    Yanyu Li, Ju Hu, Yang Wen, Georgios Evangelidis, Kamyar Salahi, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Rethinking vision transformers for mobilenet size and speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16889– 16900, 2023

  12. [12]

    Which transformer to favor: a comparative analysis of efficiency in vision transformers

    Tobias Christian Nauen, Sebastian Palacio, Federico Raue, and Andreas Dengel. Which transformer to favor: a comparative analysis of efficiency in vision transformers. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6955–6966. IEEE, 2025

  13. [13]

    Core-periphery structure in networks.SIAM Journal on Applied mathematics, 74(1):167–190, 2014

    M Puck Rombach, Mason A Porter, James H Fowler, and Peter J Mucha. Core-periphery structure in networks.SIAM Journal on Applied mathematics, 74(1):167–190, 2014

  14. [14]

    Identification of core-periphery structure in networks.Physical Review E, 91(3):032803, 2015

    Xiao Zhang, Travis Martin, and Mark EJ Newman. Identification of core-periphery structure in networks.Physical Review E, 91(3):032803, 2015

  15. [15]

    A method of matrix analysis of group structure.Psychome- trika, 14(2):95–116, 1949

    R Duncan Luce and Albert D Perry. A method of matrix analysis of group structure.Psychome- trika, 14(2):95–116, 1949

  16. [16]

    Learning ordered representations with nested dropout

    Oren Rippel, Michael Gelbart, and Ryan Adams. Learning ordered representations with nested dropout. InInternational Conference on Machine Learning, pages 1746–1754. PMLR, 2014. 11

  17. [17]

    Contrastive learning rivals masked image modeling in fine-tuning via feature distillation.arXiv preprint arXiv:2205.14141, 2022

    Yixuan Wei, Han Hu, Zhenda Xie, Zheng Zhang, Yue Cao, Jianmin Bao, Dong Chen, and Baining Guo. Contrastive learning rivals masked image modeling in fine-tuning via feature distillation.arXiv preprint arXiv:2205.14141, 2022

  18. [18]

    Am-radio: Agglomerative vision foundation model reduce all domains into one

    Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Am-radio: Agglomerative vision foundation model reduce all domains into one. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12490–12500, 2024

  19. [19]

    Theia: Distilling diverse vision foundation models for robot learning.arXiv preprint arXiv:2407.20179, 2024

    Jinghuan Shang, Karl Schmeckpeper, Brandon B May, Maria Vittoria Minniti, Tarik Kelestemur, David Watkins, and Laura Herlant. Theia: Distilling diverse vision foundation models for robot learning.arXiv preprint arXiv:2407.20179, 2024

  20. [20]

    Unic: Universal classification models via multi-teacher distillation

    Mert Bülent Sarıyıldız, Philippe Weinzaepfel, Thomas Lucas, Diane Larlus, and Yannis Kalan- tidis. Unic: Universal classification models via multi-teacher distillation. InEuropean Conference on Computer Vision, pages 353–371. Springer, 2024

  21. [21]

    Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

  22. [22]

    Using anytime algorithms in intelligent systems.AI magazine, 17(3):73–73, 1996

    Shlomo Zilberstein. Using anytime algorithms in intelligent systems.AI magazine, 17(3):73–73, 1996

  23. [23]

    Matryoshka representation learning.Advances in Neural Information Processing Systems, 35:30233–30249, 2022

    Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ra- manujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Matryoshka representation learning.Advances in Neural Information Processing Systems, 35:30233–30249, 2022

  24. [24]

    Triangular dropout: variable network width without retraining.arXiv preprint arXiv:2205.01235, 2022

    Edward W Staley and Jared Markowitz. Triangular dropout: variable network width without retraining.arXiv preprint arXiv:2205.01235, 2022

  25. [25]

    Distributional Principal Autoencoders

    Xinwei Shen and Nicolai Meinshausen. Distributional principal autoencoders.arXiv preprint arXiv:2404.13649, 2024

  26. [26]

    Ordered embeddings and intrin- sic dimensionalities with information-ordered bottlenecks.Machine Learning: Science and Technology, 6(3):035010, 2025

    Matthew Ho, Xiaosheng Zhao, and Benjamin D Wandelt. Ordered embeddings and intrin- sic dimensionalities with information-ordered bottlenecks.Machine Learning: Science and Technology, 6(3):035010, 2025

  27. [27]

    A pca-like autoencoder.arXiv preprint arXiv:1904.01277, 2019

    Saïd Ladjal, Alasdair Newson, and Chi-Hieu Pham. A pca-like autoencoder.arXiv preprint arXiv:1904.01277, 2019

  28. [28]

    Runtime configurable deep neural networks for energy-accuracy trade-off

    Hokchhay Tann, Soheil Hashemi, R Iris Bahar, and Sherief Reda. Runtime configurable deep neural networks for energy-accuracy trade-off. InProceedings of the eleventh IEEE/acm/ifip International Conference on Hardware/Software Codesign and System Synthesis, pages 1–10, 2016

  29. [29]

    Nestednet: Learning nested sparse structures in deep neural networks

    Eunwoo Kim, Chanho Ahn, and Songhwai Oh. Nestednet: Learning nested sparse structures in deep neural networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8669–8678, 2018

  30. [30]

    Slimmable neural networks.arXiv preprint arXiv:1812.08928, 2018

    Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas Huang. Slimmable neural networks.arXiv preprint arXiv:1812.08928, 2018

  31. [31]

    Once-for-all: Train one network and specialize it for efficient deployment,

    Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once-for-all: Train one network and specialize it for efficient deployment.arXiv preprint arXiv:1908.09791, 2019

  32. [32]

    Sortednet: A scalable and generalized framework for training modular deep neural networks.arXiv preprint arXiv:2309.00255, 2023

    Mojtaba Valipour, Mehdi Rezagholizadeh, Hossein Rajabzadeh, Parsa Kavehzadeh, Marzieh Tahaei, Boxing Chen, and Ali Ghodsi. Sortednet: A scalable and generalized framework for training modular deep neural networks.arXiv preprint arXiv:2309.00255, 2023

  33. [33]

    Subnetwork-to-go: Elastic neural network with dynamic training and customizable inference

    Kai Li and Yi Luo. Subnetwork-to-go: Elastic neural network with dynamic training and customizable inference. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6775–6779. IEEE, 2024. 12

  34. [34]

    Slicing vision transformer for flexible inference.Advances in Neural Information Processing Systems, 37:42649–42671, 2024

    Yitian Zhang, Huseyin Coskun, Xu Ma, Huan Wang, Ke Ma, Xi Chen, Derek H Hu, and Yun Fu. Slicing vision transformer for flexible inference.Advances in Neural Information Processing Systems, 37:42649–42671, 2024

  35. [35]

    Matformer: Nested transformer for elastic inference.Advances in Neural Information Processing Systems, 37: 140535–140564, 2024

    Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, and Prateek Jain. Matformer: Nested transformer for elastic inference.Advances in Neural Information Processing Systems, 37: 140535–140564, 2024

  36. [36]

    Flextron: Many-in-one flexible large language model.arXiv preprint arXiv:2406.10260, 2024

    Ruisi Cai, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, and Pavlo Molchanov. Flextron: Many-in-one flexible large language model.arXiv preprint arXiv:2406.10260, 2024

  37. [37]

    Branchynet: Fast inference via early exiting from deep neural networks

    Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung. Branchynet: Fast inference via early exiting from deep neural networks. In2016 23rd International Conference on Pattern Recognition (ICPR), pages 2464–2469. IEEE, 2016

  38. [38]

    Blockdrop: Dynamic inference paths in residual networks

    Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S Davis, Kristen Grauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8817– 8826, 2018

  39. [39]

    Fastbert: a self- distilling bert with adaptive inference time

    Weijie Liu, Peng Zhou, Zhiruo Wang, Zhe Zhao, Haotang Deng, and Qi Ju. Fastbert: a self- distilling bert with adaptive inference time. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 6035–6044, 2020

  40. [40]

    Adaptive Computation Time for Recurrent Neural Networks

    Alex Graves. Adaptive computation time for recurrent neural networks.arXiv preprint arXiv:1603.08983, 2016

  41. [41]

    Universal Transformers

    Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni- versal transformers.arXiv preprint arXiv:1807.03819, 2018

  42. [42]

    Accelerating optimization via differentiable stopping time.arXiv preprint arXiv:2505.22509, 2025

    Zhonglin Xie, Yiman Fong, Haoran Yuan, and Zaiwen Wen. Accelerating optimization via differentiable stopping time.arXiv preprint arXiv:2505.22509, 2025

  43. [43]

    Loopformer: Elastic-depth looped trans- formers for latent reasoning via shortcut modulation.arXiv preprint arXiv:2602.11451, 2026

    Ahmadreza Jeddi, Marco Ciccone, and Babak Taati. Loopformer: Elastic-depth looped trans- formers for latent reasoning via shortcut modulation.arXiv preprint arXiv:2602.11451, 2026

  44. [44]

    Token Merging: Your ViT But Faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022

  45. [45]

    Adaptive computation with elastic input sequence

    Fuzhao Xue, Valerii Likhosherstov, Anurag Arnab, Neil Houlsby, Mostafa Dehghani, and Yang You. Adaptive computation with elastic input sequence. InInternational Conference on Machine Learning, pages 38971–38988. PMLR, 2023

  46. [46]

    Thinkingvit: Matryoshka thinking vision transformer for elastic inference.arXiv preprint arXiv:2507.10800, 2025

    Ali Hojjat, Janek Haberer, Soren Pirk, and Olaf Landsiedel. Thinkingvit: Matryoshka thinking vision transformer for elastic inference.arXiv preprint arXiv:2507.10800, 2025

  47. [47]

    Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts

    Naibin Gu, Zhenyu Zhang, Yuchen Feng, Yilong Chen, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, et al. Elastic moe: Unlocking the inference-time scalability of mixture-of-experts.arXiv preprint arXiv:2509.21892, 2025

  48. [48]

    Training matryoshka mixture-of-experts for elastic inference- time expert utilization.arXiv preprint arXiv:2509.26520, 2025

    Yaoxiang Wang, Qingguo Hu, Yucheng Ding, Ruizhe Wang, Yeyun Gong, Jian Jiao, Yelong Shen, Peng Cheng, and Jinsong Su. Training matryoshka mixture-of-experts for elastic inference- time expert utilization.arXiv preprint arXiv:2509.26520, 2025

  49. [49]

    A- vit: Adaptive tokens for efficient vision transformer

    Hongxu Yin, Arash Vahdat, Jose M Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. A- vit: Adaptive tokens for efficient vision transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10809–10818, 2022

  50. [50]

    Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937–13949, 2021

    Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937–13949, 2021. 13

  51. [51]

    Matryoshka query transformer for large vision-language models.Advances in Neural Information Processing Systems, 37:50168–50188, 2024

    Wenbo Hu, Zi-Yi Dou, Liunian H Li, Amita Kamath, Nanyun Peng, and Kai-Wei Chang. Matryoshka query transformer for large vision-language models.Advances in Neural Information Processing Systems, 37:50168–50188, 2024

  52. [52]

    Efficient attention: Attention with linear complexities.arXiv preprint arXiv:1812.01243, 2018

    Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. Efficient attention: Attention with linear complexities.arXiv preprint arXiv:1812.01243, 2018

  53. [53]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning, pages 5156–5165. PMLR, 2020

  54. [54]

    Rethinking Attention with Performers

    Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794, 2020

  55. [55]

    The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry.arXiv preprint arXiv:2402.04347,

    Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher Ré. The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry.arXiv preprint arXiv:2402.04347, 2024

  56. [56]

    RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts

    Sahil Joshi, Agniva Chowdhury, Amar Kanakamedala, Ekam Singh, Evan Tu, and Anshumali Shrivastava. Replacing softmax similarity with a sharpened angular similarity: Theory and practice of scaling to billion-context attention.arXiv preprint arXiv:2510.04008, 2025

  57. [57]

    Zeros: Zero-sum linear attention for efficient transformers.arXiv preprint arXiv:2602.05230, 2026

    Jiecheng Lu, Xu Han, Yan Sun, Viresh Pati, Yubin Kim, Siddhartha Somani, and Shihao Yang. Zeros: Zero-sum linear attention for efficient transformers.arXiv preprint arXiv:2602.05230, 2026

  58. [58]

    Linformer: Self-Attention with Linear Complexity

    Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768, 2020

  59. [59]

    Nyströmformer: A nyström-based algorithm for approximating self-attention

    Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nyströmformer: A nyström-based algorithm for approximating self-attention. InProceedings of the AAAI conference on artificial intelligence, volume 35, pages 14138–14148, 2021

  60. [60]

    Reformer: The Efficient Transformer

    Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020

  61. [61]

    Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33:17283–17297, 2020

    Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santi- ago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33:17283–17297, 2020

  62. [62]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

  63. [63]

    Efficient content-based sparse attention with routing transformers.Transactions of the Association for Computational Linguistics, 9:53–68, 2021

    Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers.Transactions of the Association for Computational Linguistics, 9:53–68, 2021

  64. [64]

    Star attention: Efficient llm inference over long sequences, arXiv preprint arXiv:2411.17116, 2024

    Shantanu Acharya, Fei Jia, and Boris Ginsburg. Star attention: Efficient llm inference over long sequences.arXiv preprint arXiv:2411.17116, 2024

  65. [65]

    Synthesizer: Rethinking self-attention for transformer models

    Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. Synthesizer: Rethinking self-attention for transformer models. InInternational Conference on Machine Learning, pages 10183–10192. PMLR, 2021

  66. [66]

    Set transformer: A framework for attention-based permutation-invariant neural networks

    Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning, pages 3744–3753. PMLR, 2019

  67. [67]

    Perceiver: General perception with iterative attention

    Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InInternational Conference on Machine Learning, pages 4651–4664. PMLR, 2021. 14

  68. [68]

    Perceiver io: A general architecture for structured inputs & outputs.ICLR, 2022

    Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver io: A general architecture for structured inputs & outputs.ICLR, 2022

  69. [69]

    Latte: Latent attention for linear time transformers

    Rares Dolga, Lucas Maystre, Marius Cobzarenco, and David Barber. Latte: Latent attention for linear time transformers. 2024

  70. [70]

    Luna: Linear unified nested attention.Advances in Neural Information Processing Systems, 34:2441–2453, 2021

    Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, and Luke Zettlemoyer. Luna: Linear unified nested attention.Advances in Neural Information Processing Systems, 34:2441–2453, 2021

  71. [71]

    Attention bottlenecks for multimodal fusion.Advances in neural information processing systems, 34:14200–14213, 2021

    Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, and Chen Sun. Attention bottlenecks for multimodal fusion.Advances in neural information processing systems, 34:14200–14213, 2021

  72. [72]

    12 Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré

    Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A Smith, and Lingpeng Kong. Random feature attention.arXiv preprint arXiv:2103.02143, 2021

  73. [73]

    Going beyond linear transformers with recurrent fast weight programmers.Advances in neural information processing systems, 34:7703–7717, 2021

    Kazuki Irie, Imanol Schlag, Róbert Csordás, and Jürgen Schmidhuber. Going beyond linear transformers with recurrent fast weight programmers.Advances in neural information processing systems, 34:7703–7717, 2021

  74. [74]

    On the parameterization and initialization of diagonal state space models.Advances in neural information processing systems, 35:35971–35983, 2022

    Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré. On the parameterization and initialization of diagonal state space models.Advances in neural information processing systems, 35:35971–35983, 2022

  75. [75]

    Abc: Attention with bounded-memory control

    Hao Peng, Jungo Kasai, Nikolaos Pappas, Dani Yogatama, Zhaofeng Wu, Lingpeng Kong, Roy Schwartz, and Noah A Smith. Abc: Attention with bounded-memory control. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7469–7483, 2022

  76. [76]

    Fine-tuning pre-trained transformers into decaying fast weights

    Huanru Henry Mao. Fine-tuning pre-trained transformers into decaying fast weights. In Proceedings of the 2022 conference on empirical methods in natural language processing, pages 10236–10242, 2022

  77. [77]

    Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635, 2023

    Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635, 2023

  78. [78]

    Retentive Network: A Successor to Transformer for Large Language Models

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023

  79. [79]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

  80. [80]

    arXiv preprint arXiv:2404.07904 , year=

    Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. Hgrn2: Gated linear rnns with state expansion.arXiv preprint arXiv:2404.07904, 2024

Showing first 80 references.