pith. machine review for the scientific record. sign in

arxiv: 2605.10661 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Recognition: no theorem link

bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision transformersrecurrent modelsimage classificationparameter efficiencyImageNettransfer learningimplicit depth
0
0 comments X

The pith

A single shared transformer block reused recurrently can match the accuracy of a full-depth Vision Transformer while using an order of magnitude fewer parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how much of the performance in stacked Vision Transformers comes from depth via unique layer parameters versus iterative computation on evolving representations. It introduces bViT, which applies one transformer block repeatedly across multiple steps to an image patch sequence, keeping the iterative refinement structure but eliminating per-layer parameterization. Experiments show that a 12-step bViT-B reaches ImageNet-1K top-1 accuracy comparable to a standard 12-layer ViT-B when trained identically and under the same compute budget, yet requires roughly ten times fewer parameters. Wider hidden dimensions allow the recurrent model to recover more of the baseline performance, which the authors attribute to the shared block realizing step-dependent computations through changes in the hidden state.

Core claim

bViT processes images by repeatedly applying the identical transformer block, preserving the multi-step iterative structure of deep ViTs without dedicating separate parameters to each layer. On ImageNet-1K the 12-step bViT-B attains accuracy comparable to standard ViT-B under matched training recipe and computational budget while using an order of magnitude fewer parameters. Recurrent accuracy rises with representation width, interpreted as implicit depth multiplexing in which the shared block expresses different effective transformations as the hidden state evolves. Mechanistic probes of activations, attention maps, and step-wise pruning confirm that the block alters its behavior across the

What carries the argument

Single-block recurrence in which one transformer block is applied repeatedly to an evolving hidden state, allowing step-dependent computations without layer-specific weights.

If this is right

  • Wider representation dimensions let recurrent models recover a larger fraction of standard ViT performance.
  • bViT achieves competitive transfer accuracy on downstream image tasks.
  • The architecture supports parameter-efficient fine-tuning by updating only the shared block.
  • Analyses of activations and attention show the shared block changes its effective computation across recurrent steps rather than repeating identical operations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Recurrent reuse could lower peak memory during training by keeping only one block in GPU memory at a time.
  • The width-recurrence tradeoff may extend to other transformer-based sequence models beyond vision.
  • Step-dependent pruning results suggest that future work could learn or schedule different effective depths per image or task.

Load-bearing premise

The training recipe and total computational budget are truly equivalent between the recurrent bViT and the standard stacked ViT, so that any accuracy match arises from the recurrence itself rather than uncontrolled differences in optimization or implementation.

What would settle it

Train a non-recurrent single-block ViT (one block applied once) with the same total parameter count and compute as the 12-step bViT-B and measure whether its ImageNet accuracy falls substantially below the recurrent version.

Figures

Figures reproduced from arXiv: 2605.10661 by Alberto Presta, Grzegorz Gruszczynski, Grzegorz Stefanski, Michal Byra, Pawel Olszowiec.

Figure 1
Figure 1. Figure 1: We investigate a single-block vision transformer, bViT, in which the same transformer block is applied recurrently across depth. Reusing a single shared block improves parameter effi￾ciency and enables mechanistic analysis by, for instance, allowing the same attention heads and FFN neurons to be tracked across recurrent steps. where T is the number of recurrent time steps. This formulation can be viewed as… view at source ↗
Figure 2
Figure 2. Figure 2: Increasing recurrence beyond 12 steps does not improve bViT-S ImageNet-1K accuracy, while larger embeddings improve performance. 4.1.2 Width and recurrence To better understand the roles of model width and recurrent computation, we perform addi￾tional ImageNet-1K experiments based on the bViT-S configuration, with results shown in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Optimization based FFN neuron visu￾alizations across bViT-B recurrence steps. Early steps show simple periodic textures, while later steps reveal more structured patterns, indicating step dependent neuron behavior. Neuron activation patterns in standard ViTs ex￾hibit a depth dependent progression, with early blocks responding to simple patterns such as edges or textures, and later blocks responding to more… view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation of attention maps across recurrent steps on ImageNet-S. We report the mean [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pruning reveals step-specific weight usage in bViT. (a) Active weights are categorized into [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: 95% energy rank of the FFN in-projection, FFN out-projection and CLS tokens across [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: ImageNet-1K validation accuracy as a function of truncated-SVD rank applied to the FFN [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: ImageNet-1K validation accuracy of bViT-B across recurrent inference steps. Accu￾racy peaks near the 12-step training horizon and decreases when the shared block is applied beyond the number of steps used during training [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: PaCMAP visualization of latent trajectories across recurrent steps for bViT-B trained on [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Pointing-game hit ratio for individual attention heads across recurrent steps in bViT-B. [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Pointing-game hit ratio for individual attention heads across recurrent steps in bViT-B-TE. [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Pointing-game hit ratio for individual attention heads across recurrent steps in bViT [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative comparison of attention maps across recurrent steps for the heads with the [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Pruning dynamics across recurrent steps. (A-D) Specialization per step (1-12) as a function [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗
read the original abstract

Vision Transformers (ViTs) are built by stacking independently parameterized blocks, but it remains unclear how much of this depth requires layer specific transformations and how much can be realized through recurrent computation. We study this question with bViT, a single-block recurrent ViT in which one transformer block is applied repeatedly to process an image. This architecture preserves the iterative structure of a deep ViT while removing layer specific block parameterization, providing a controlled setting for studying recurrence in vision. On ImageNet-1K, a 12-step bViT-B achieves accuracy comparable to standard ViT-B under the same training recipe and computational budget, while using an order of magnitude fewer parameters. We observe that recurrent performance improves with representation width, with wider bViTs recovering much more of the performance of standard ViTs than narrow variants. We interpret this behavior as implicit depth multiplexing, where a shared block expresses multiple step-dependent computations through the evolving hidden state. Beyond ImageNet classification, bViT transfers competitively to downstream tasks and enables parameter-efficient fine-tuning. Mechanistic analyses of activations, attention and step-specific pruning show that the shared block changes its effective behavior across recurrent steps rather than simply repeating the same computation. Our results suggest that a large fraction of ViT depth can be implemented through recurrent reuse, provided that the representation space is sufficiently wide.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces bViT, a Vision Transformer that replaces the standard stack of independently parameterized blocks with a single shared transformer block applied recurrently for a fixed number of steps. The central empirical claim is that a 12-step bViT-B achieves ImageNet-1K accuracy comparable to a standard ViT-B under the same training recipe and computational budget while using an order of magnitude fewer parameters. The authors support this with mechanistic analyses of activations, attention patterns, and step-specific pruning that indicate the shared block changes its effective computation across steps rather than repeating identical operations. They further report that wider representations recover more of the standard ViT performance, interpret this as implicit depth multiplexing, and show competitive transfer to downstream tasks plus benefits for parameter-efficient fine-tuning.

Significance. If the matched-budget and matched-recipe claims are substantiated, the result would be significant for understanding the role of depth versus recurrence in Vision Transformers and for designing parameter-efficient vision models. The controlled single-block setup isolates recurrence effects more cleanly than prior recurrent transformer variants, and the width-scaling observation plus mechanistic analyses provide concrete evidence that a shared block can express step-dependent transformations through evolving hidden states. The parameter reduction and downstream transfer results are concrete strengths that could influence efficient architecture design.

major comments (3)
  1. [Section 4] Section 4 (Experiments) and Table 1: The claim that computational budgets are matched between 12-step bViT-B and standard ViT-B is load-bearing for attributing accuracy parity to recurrence, yet the manuscript provides no per-step FLOPs breakdown, total training FLOPs, peak memory, or wall-clock time comparison. Recurrent unrolling can alter caching, mixed-precision behavior, and gradient flow relative to independent blocks, so explicit verification is required to rule out incidental optimization differences.
  2. [Section 4.2] Section 4.2 (Ablations and controls): No ablation is presented against a non-recurrent weight-tied baseline (e.g., single application of the shared block or fixed hidden-state reuse) or against an independently parameterized model constrained to the same total parameter count. Without these controls it is difficult to isolate the performance contribution of recurrence itself from weight sharing or other architectural choices.
  3. [Results] Results section and Table 1: Exact top-1 accuracies, standard deviations across multiple random seeds, and the precise ViT-B baseline accuracy under the identical recipe are not reported; only the qualitative statement “comparable” appears. This prevents quantitative assessment of whether the observed parity is robust or within the range of training noise.
minor comments (2)
  1. [Figure 3] Figure 3 (attention visualizations): Step indices and any quantitative measures of attention change across steps should be annotated directly on the figure panels for immediate readability.
  2. [Section 3] Section 3 (Model definition): The recurrence update rule and hidden-state notation would benefit from an explicit equation (e.g., h_{t+1} = Block(h_t, x)) to make the unrolling mechanics unambiguous.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects for strengthening the empirical claims and controls in our work on bViT. We address each major comment below, committing to revisions that enhance clarity and rigor while preserving the core contributions on recurrence in Vision Transformers.

read point-by-point responses
  1. Referee: [Section 4] Section 4 (Experiments) and Table 1: The claim that computational budgets are matched between 12-step bViT-B and standard ViT-B is load-bearing for attributing accuracy parity to recurrence, yet the manuscript provides no per-step FLOPs breakdown, total training FLOPs, peak memory, or wall-clock time comparison. Recurrent unrolling can alter caching, mixed-precision behavior, and gradient flow relative to independent blocks, so explicit verification is required to rule out incidental optimization differences.

    Authors: We agree that detailed verification of matched budgets is essential to isolate the effects of recurrence. Although the training recipe was designed to equate the number of block applications and overall compute (with bViT using the same operations per step as a standard block), we did not provide an explicit breakdown. In the revised manuscript, we will add a per-step FLOPs analysis, total training FLOPs, peak memory usage during training and inference, and wall-clock time comparisons on identical hardware to confirm equivalence and address potential differences in caching or gradient flow. revision: yes

  2. Referee: [Section 4.2] Section 4.2 (Ablations and controls): No ablation is presented against a non-recurrent weight-tied baseline (e.g., single application of the shared block or fixed hidden-state reuse) or against an independently parameterized model constrained to the same total parameter count. Without these controls it is difficult to isolate the performance contribution of recurrence itself from weight sharing or other architectural choices.

    Authors: This is a fair point for isolating recurrence effects. We will add a 1-step bViT baseline (single application of the shared block) to Table 1 and the ablations section to quantify the benefit of multiple recurrent steps. For fixed hidden-state reuse, we can include a control where the state is not updated after the first step. Regarding an independently parameterized model with the same total parameter count, this equates to a 1-block standard ViT, which we will explicitly compare as a parameter-matched shallow baseline. Our existing mechanistic analyses (evolving activations, attention patterns, and step-specific pruning) already demonstrate that the shared block performs distinct computations across steps rather than repeating identical operations, providing evidence beyond weight sharing alone; we will expand the discussion to tie these directly to the new controls. revision: partial

  3. Referee: [Results] Results section and Table 1: Exact top-1 accuracies, standard deviations across multiple random seeds, and the precise ViT-B baseline accuracy under the identical recipe are not reported; only the qualitative statement “comparable” appears. This prevents quantitative assessment of whether the observed parity is robust or within the range of training noise.

    Authors: We acknowledge that explicit numerical reporting is necessary for assessing robustness. Although Table 1 in the manuscript contains the accuracy values, the main text uses the term “comparable” for brevity. In the revision, we will explicitly report the exact top-1 accuracies for the 12-step bViT-B and the standard ViT-B baseline under the identical recipe, along with standard deviations computed across at least three random seeds to demonstrate that the parity holds within training variability. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical architecture comparison with no derivations or self-referential predictions

full rationale

The paper reports experimental results on ImageNet-1K accuracy, transfer learning, and mechanistic analyses of activations/attention for a recurrent single-block ViT versus a standard stacked ViT. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim (accuracy parity under matched budget) is an empirical observation, not a reduction to inputs by construction. Self-citations, if present in the full manuscript, are not invoked to justify uniqueness theorems or ansatzes that would force the result. This is the expected non-finding for an architecture-ablation study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard empirical machine-learning assumptions about training convergence and evaluation protocols on ImageNet-1K. No free parameters are explicitly introduced beyond those inherent to transformer training, and no new entities are postulated.

axioms (1)
  • domain assumption Standard deep learning assumptions on optimization, data augmentation, and ImageNet evaluation protocols hold for both bViT and baseline ViT models.
    The paper compares models under the same training recipe, which implicitly relies on these background conventions.

pith-pipeline@v0.9.0 · 5555 in / 1297 out tokens · 46427 ms · 2026-05-12T04:56:55.411802+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 3 internal anchors

  1. [1]

    Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 7(2):195–204, 2025

    Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, and David Ha. Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 7(2):195–204, 2025

  2. [2]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

  3. [3]

    Cimpoi, S

    M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014

  4. [4]

    Vision transformers need registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InInternational Conference on Learning Representations (ICLR), 2024

  5. [5]

    Universal transformers.Proceedings of the International Conference on Learning Representations (ICLR), 2019

    Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers.Proceedings of the International Conference on Learning Representations (ICLR), 2019

  6. [6]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  7. [7]

    An image is worth 16x16 words: Transformers for image recognition at scale.Proceedings of the International Conference on Learning Representations (ICLR), 2020

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.Proceedings of the International Conference on Learning Representations (ICLR), 2020

  8. [8]

    Large- scale unsupervised semantic segmentation.IEEE transactions on pattern analysis and machine intelligence, 45(6):7457–7476, 2022

    Shanghua Gao, Zhong-Yu Li, Ming-Hsuan Yang, Ming-Ming Cheng, Junwei Han, and Philip Torr. Large- scale unsupervised semantic segmentation.IEEE transactions on pattern analysis and machine intelligence, 45(6):7457–7476, 2022

  9. [9]

    Revealing the utilized rank of subspaces of learning in neural networks

    Isha Garg, Christian Koguchi, Eshan Verma, and Daniel Ulbricht. Revealing the utilized rank of subspaces of learning in neural networks. InProceedings of the AAAI Symposium Series, volume 5, pages 151–158, 2025

  10. [10]

    What do vision transformers learn? a visual exploration.ArXiv e-print, 2022

    Amin Ghiasi, Hamid Kazemi, Eitan Borgnia, Steven Reich, Manli Shu, Micah Goldblum, Andrew Gordon Wilson, and Tom Goldstein. What do vision transformers learn? a visual exploration.ArXiv e-print, 2022

  11. [11]

    Looped transformers as programmable computers

    Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D Lee, and Dimitris Papail- iopoulos. Looped transformers as programmable computers. InInternational Conference on Machine Learning, pages 11398–11442. PMLR, 2023

  12. [12]

    A token is worth over 1,000 tokens: Efficient knowledge distillation through low-rank clone.Advances in Neural Information Processing Systems (NeurIPS), 2025

    Jitai Hao, Qiang Huang, Hao Liu, Xinyan Xiao, Zhaochun Ren, and Jun Yu. A token is worth over 1,000 tokens: Efficient knowledge distillation through low-rank clone.Advances in Neural Information Processing Systems (NeurIPS), 2025

  13. [13]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

  14. [14]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  15. [15]

    Block-recurrent dynamics in vision transformers.arXiv preprint arXiv:2512.19941, 2025

    Mozes Jacobs, Thomas Fel, Richard Hakim, Alessandra Brondetta, Demba Ba, and T Andy Keller. Block-recurrent dynamics in vision transformers.arXiv preprint arXiv:2512.19941, 2025

  16. [16]

    Less is More: Recursive Reasoning with Tiny Networks

    Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks.arXiv preprint arXiv:2510.04871, 2025

  17. [17]

    3d object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InProceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013

  18. [18]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009

  19. [19]

    Albert: A lite bert for self-supervised learning of language representations.Proceedings of the International Conference on Learning Representations (ICLR), 2020

    Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations.Proceedings of the International Conference on Learning Representations (ICLR), 2020. 10

  20. [20]

    Uni-lora: One vector is all you need

    Kaiyang Li, Shaobo Han, Qing Su, Wei Li, Zhipeng Cai, and Shihao Ji. Uni-lora: One vector is all you need. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2017

  21. [21]

    Simple recursive model: Simplified, single-state reasoning with skip connections

    Qianli Liao and Tomaso Poggio. Simple recursive model: Simplified, single-state reasoning with skip connections. 2026

  22. [22]

    Decoupled weight decay regularization.Proceedings of the International Conference on Learning Representations (ICLR), 2019

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.Proceedings of the International Conference on Learning Representations (ICLR), 2019

  23. [23]

    Recurrent vision transformer for solving visual reasoning problems

    Nicola Messina, Giuseppe Amato, Fabio Carrara, Claudio Gennaro, and Fabrizio Falchi. Recurrent vision transformer for solving visual reasoning problems. InInternational Conference on Image Analysis and Processing, pages 50–61. Springer, 2022

  24. [24]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008

  25. [25]

    Parameter reduction improves vision transformers: A comparative study of sharing and width reduction.arXiv e-prints, pages arXiv–2512, 2025

    Anantha Padmanaban Krishna Kumar. Parameter reduction improves vision transformers: A comparative study of sharing and width reduction.arXiv e-prints, pages arXiv–2512, 2025

  26. [26]

    Cats and dogs

    Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012

  27. [27]

    Pytorch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-perfo...

  28. [28]

    Learning transferable visual models from natural language supervision.ArXiv e-print, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision.ArXiv e-print, 2021

  29. [29]

    Tied-lora: Enhancing parameter efficiency of lora with weight tying

    Adithya Renduchintala, Tugrul Konuk, and Oleksii Kuchaiev. Tied-lora: Enhancing parameter efficiency of lora with weight tying. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8694–8705, 2024

  30. [30]

    Reasoning with latent thoughts: On the power of looped transformers.Proceedings of the International Conference on Learning Representations (ICLR), 2025

    Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J Reddi. Reasoning with latent thoughts: On the power of looped transformers.Proceedings of the International Conference on Learning Representations (ICLR), 2025

  31. [31]

    Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks

    Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Furong Huang, Uzi Vishkin, Micah Goldblum, and Tom Goldstein. Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks. Advances in Neural Information Processing Systems, 34:6695–6706, 2021

  32. [32]

    Loopvit: Scaling visual arc with looped transformers.iclr, 2026

    Wen-Jie Shu, Xuerui Qiu, Rui-Jie Zhu, Harold Haodong Chen, Yexin Liu, and Harry Yang. Loopvit: Scaling visual arc with looped transformers.iclr, 2026

  33. [33]

    Routing the lottery: Adaptive subnetworks for heterogeneous data.arXiv preprint arXiv:2601.22141, 2026

    Grzegorz Stefanski, Alberto Presta, and Michal Byra. Routing the lottery: Adaptive subnetworks for heterogeneous data.arXiv preprint arXiv:2601.22141, 2026

  34. [34]

    Transformer layers as painters

    Qi Sun, Marc Pickett, Aakash Kumar Nain, and Llion Jones. Transformer layers as painters. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 25219–25227, 2025

  35. [35]

    Generalized linear mode connectivity for transformers

    Alexander Theus, Alessandro Cabodi, Sotiris Anagnostidis, Antonio Orvieto, Sidak Pal Singh, and Valentina Boeva. Generalized linear mode connectivity for transformers. 2025

  36. [36]

    Training data-efficient image transformers & distillation through attention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. InInternational conference on machine learning, pages 10347–10357. PMLR, 2021

  37. [37]

    Attention is all you need.Advances in Neural Information Processing Systems (NeurIPS), 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems (NeurIPS), 2017

  38. [38]

    Hierarchical Reasoning Model

    Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and Yasin Abbasi Yadkori. Hierarchical reasoning model.arXiv preprint arXiv:2506.21734, 2025. 11

  39. [39]

    Yingfan Wang, Haiyang Huang, Cynthia Rudin, and Yaron Shaposhnik. Understanding how dimension reduction tools work: An empirical approach to deciphering t-sne, umap, trimap, and pacmap for data visualization.Journal of Machine Learning Research, 22(201):1–73, 2021

  40. [40]

    Zero time waste: Recycling predictions in early exit neural networks.Advances in Neural Information Processing Systems, 34:2516–2528, 2021

    Maciej Wołczyk, Bartosz Wójcik, Klaudia Bałazy, Igor T Podolak, Jacek Tabor, Marek´Smieja, and Tomasz Trzcinski. Zero time waste: Recycling predictions in early exit neural networks.Advances in Neural Information Processing Systems, 34:2516–2528, 2021

  41. [41]

    Multimodal learning with transformers: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10):12113–12132, 2023

    Peng Xu, Xiatian Zhu, and David A Clifton. Multimodal learning with transformers: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10):12113–12132, 2023

  42. [42]

    Looped transformers are better at learning learning algorithms.Proceedings of the International Conference on Learning Representations (ICLR), 2024

    Liu Yang, Kangwook Lee, Robert Nowak, and Dimitris Papailiopoulos. Looped transformers are better at learning learning algorithms.Proceedings of the International Conference on Learning Representations (ICLR), 2024

  43. [43]

    Hyperloop Transformers

    Abbas Zeitoun, Lucas Torroba-Hennigen, and Yoon Kim. Hyperloop transformers.arXiv preprint arXiv:2604.21254, 2026

  44. [44]

    Root mean square layer normalization

    Biao Zhang and Rico Sennrich. Root mean square layer normalization. InAdvances in Neural Information Processing Systems, volume 32, 2019

  45. [45]

    Top-down neural attention by excitation backprop

    Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neural attention by excitation backprop. InEuropean Conference on Computer Vision, pages 543–559. Springer, 2016

  46. [46]

    Minivit: Compressing vision transformers with weight multiplexing

    Jinnian Zhang, Houwen Peng, Kan Wu, Mengchen Liu, Bin Xiao, Jianlong Fu, and Lu Yuan. Minivit: Compressing vision transformers with weight multiplexing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12145–12154, 2022

  47. [47]

    Godec: Randomized low-rank & sparse matrix decomposition in noisy case

    Tianyi Zhou and Dacheng Tao. Godec: Randomized low-rank & sparse matrix decomposition in noisy case. InProceedings of the 28th International Conference on Machine Learning, ICML 2011, 2011. 12 Appendix A A capacity model and rank analysis A.1 A capacity model for recurrent depth sharing In the main manuscript, we interpret the width dependence of bViT as ...

  48. [48]

    Train the full model forEepochs with all parameters unfrozen using current masks

  49. [49]

    ,12}: • Initialize a graft encoderθ graft fromθ init

    For each loop iterationt∈ {1, . . . ,12}: • Initialize a graft encoderθ graft fromθ init. • Apply the current masks. • Freeze the main encoder and train only the graft encoder for G epochs, using it exclusively at loop steptinstead of main encoder during forward passes. • Compute weight magnitudes over active parameters (whereM t = 1). • Identify the smal...

  50. [50]

    Update masks by setting selected weights to zero separately for each loop

  51. [51]

    In our experiments, we used P= 30 pruning steps, E= 200 full model training epochs, G= 100 graft training epochs, andr % = 20%pruning threshold

    Rewind weights toθ init while preserving updated masks. In our experiments, we used P= 30 pruning steps, E= 200 full model training epochs, G= 100 graft training epochs, andr % = 20%pruning threshold. G.4 Analysis metrics To analyze the learned pathways, we compute overlap and specialization metrics based on extracted binary masks. Jaccard similarity.We m...