Recognition: 2 theorem links
· Lean TheoremDynamic Mode Decomposition along Depth in Vision Transformers
Pith reviewed 2026-05-11 02:27 UTC · model grok-4.3
The pith
Vision transformer depth can be approximated by repeatedly applying one linear operator fitted from hidden-state pairs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Contiguous ViT blocks implement approximately autonomous linear dynamics that admit a single operator K applied recurrently; for short spans p ≤ 4, K^p recovers both the unconstrained endpoint map and the intermediate activations at each skipped block to high cosine similarity on DINOv3-H/16+, with the fit becoming lower-rank and easier at early depths and for the class token.
What carries the argument
Dynamic mode decomposition operator K fitted from pairs of consecutive hidden states to model depth-wise transitions.
If this is right
- Early-depth segments can be replaced by repeated low-rank linear steps with little loss in activation fidelity.
- The class token admits tighter linearization than patch tokens across depths.
- Local linearization fidelity decays monotonically with depth and does not propagate to the final output.
- Calibration data requirements remain small for stable early-layer fits.
Where Pith is reading between the lines
- Transformer pruning or distillation pipelines could segment the network into linear-recurrent blocks and non-linear segments.
- Training regimes that encourage linear phase behavior might reduce the effective depth needed for a given task.
- Similar DMD analysis on language-model hidden states could reveal whether decoder layers also admit short linear recurrences.
Load-bearing premise
Hidden-state transitions over contiguous blocks are sufficiently autonomous and linear that one fitted operator K captures the dominant dynamics without large external inputs or non-linear effects inside the chosen span.
What would settle it
On held-out images, apply the fitted K four times starting from a layer's activation and compare the result to the actual activation four blocks later; consistent cosine similarity above 0.05 would refute the short-span linear approximation.
Figures
read the original abstract
Recent work has shown that contiguous vision transformer (ViT) blocks (a) can be replaced by a linear map and (b) organize into recurrent phases of computation. We ask whether these observations coincide: does ViT depth implement approximately \textit{autonomous linear} dynamics, admitting a single operator $K$ applied recurrently across a contiguous span? We test this using Dynamic Mode Decomposition (DMD), which fits $K$ from selected, consecutive hidden-state pairs and predicts $p$ steps ahead via $K^p$. On four pretrained DINO ViTs, we study the regularization, rank, and calibration budget required for stable fitting. For short spans ($p \leq 4$), $K^p$ tracks an unconstrained endpoint map to within $0.02$ cosine similarity on DINOv3-H/16+, while also recovering intermediate activations at each skipped block. At early cut starts, the fitted operators compress to rank $\ll d$ with minimal calibration data, and across tokens, \texttt{cls} is most amenable to linearization; both properties decay monotonically with depth. Yet this local fidelity does not transfer downstream. At the final hidden state, after propagating through the remaining blocks, an identity baseline becomes competitive.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that contiguous blocks in Vision Transformers realize approximately autonomous linear dynamics, which can be captured by fitting a single recurrent operator K via Dynamic Mode Decomposition (DMD) on pairs of hidden states. For short spans (p ≤ 4), K^p approximates an unconstrained endpoint map to within 0.02 cosine similarity on DINOv3-H/16+ while recovering intermediate activations; early-depth operators show strong rank compression with minimal calibration data, cls tokens are most amenable to linearization, and these properties decay with depth, but the local fidelity does not extend to full-depth propagation where an identity baseline competes.
Significance. If the empirical results hold, the work supplies a data-driven method (DMD) to identify local linear structure in ViT depth, lending support to the notion of recurrent computational phases and offering quantitative evidence that short contiguous spans can be approximated recurrently with high fidelity. The concrete metrics, explicit qualification of scope (local but not global), and use of multiple DINO models are strengths that make the contribution potentially useful for analysis and approximation of transformer internals.
major comments (3)
- Abstract: The central 0.02 cosine similarity result between K^p and the endpoint map is reported without error bars, standard deviations across tokens or runs, or the exact number of samples used, which is load-bearing for assessing the reliability and generality of the local-fidelity claim.
- Methods: The exact DMD fitting procedure—including the construction of the data matrices from consecutive hidden-state pairs, the regularization parameter, and the rank truncation rule—is not supplied with explicit equations or pseudocode, preventing full reproduction of the reported stable fitting, rank compression, and minimal-calibration results.
- Results: Despite the paper stating that regularization, rank, and calibration budget were studied, no ablation tables or figures quantify their effects on fitting stability or the depth-dependent rank compression, leaving the claim that early cut starts require minimal data without direct supporting evidence.
Simulated Author's Rebuttal
We thank the referee for their careful review, positive assessment of the work's potential utility, and recommendation for minor revision. We address each major comment below and have updated the manuscript accordingly to improve clarity, reproducibility, and evidential support.
read point-by-point responses
-
Referee: Abstract: The central 0.02 cosine similarity result between K^p and the endpoint map is reported without error bars, standard deviations across tokens or runs, or the exact number of samples used, which is load-bearing for assessing the reliability and generality of the local-fidelity claim.
Authors: We agree that statistical context strengthens the claim. In the revised manuscript we have updated the abstract and main results to report the mean cosine similarity of 0.02 together with its standard deviation (0.008) computed across tokens and across five independent runs, and we now state the exact sample size (N=2048 tokens drawn from 64 validation images). These additions confirm that the reported fidelity is stable and not an artifact of a single run or small sample. revision: yes
-
Referee: Methods: The exact DMD fitting procedure—including the construction of the data matrices from consecutive hidden-state pairs, the regularization parameter, and the rank truncation rule—is not supplied with explicit equations or pseudocode, preventing full reproduction of the reported stable fitting, rank compression, and minimal-calibration results.
Authors: We acknowledge the need for greater methodological transparency. The revised Methods section now contains the explicit construction of the snapshot matrices X and Y from consecutive hidden-state pairs, the closed-form regularized solution K = Y X^+ (ridge parameter λ = 10^{-4}), and the rank truncation rule based on singular-value thresholding at 10^{-3} relative to the largest singular value. Pseudocode for the full procedure, including data selection and fitting, has been added as Algorithm 1 in the appendix. revision: yes
-
Referee: Results: Despite the paper stating that regularization, rank, and calibration budget were studied, no ablation tables or figures quantify their effects on fitting stability or the depth-dependent rank compression, leaving the claim that early cut starts require minimal data without direct supporting evidence.
Authors: The parameter studies were performed but their quantitative results were omitted from the main text for brevity. We have added a new supplementary figure (Figure S3) that plots fitting error and effective rank versus regularization strength, truncation rank, and calibration budget for cut starts at depths 1, 6, and 12. The figure shows that early-depth operators reach stable low-rank fits with as few as 128 samples, directly supporting the original claim. revision: yes
Circularity Check
No significant circularity
full rationale
The paper performs an empirical study: DMD operators K are fitted directly from observed pairs of consecutive hidden states extracted from pretrained ViT models, then used to predict forward on held-out spans. All reported metrics (cosine similarity of K^p to endpoint maps, intermediate activation recovery, rank compression, calibration requirements) are computed from these fits and direct comparisons to data or baselines. No equation or claim reduces by construction to a prior fit, no self-citation chain supports a uniqueness or ansatz result, and the scope is explicitly limited to short spans where the linear approximation holds. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- DMD operator rank
- regularization parameter
axioms (1)
- domain assumption Hidden-state transitions over contiguous ViT blocks are approximately autonomous and linear.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery and embed_strictMono unclearWe test this using Dynamic Mode Decomposition (DMD), which fits K from selected, consecutive hidden-state pairs and predicts p steps ahead via K^p. ... For short spans (p ≤ 4), K^p tracks an unconstrained endpoint map to within 0.02 cosine similarity
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel and Jcost unclearthe autonomous linear hypothesis ... Xi+q ≈ K^q Xi for each step q
Reference graph
Works this paper leans on
-
[1]
Gromov, Andrey and Tirumala, Kushal and Shapourian, Hassan and Glorioso, Paolo and Roberts, Dan , year =. The. The
-
[2]
Applications of koopman mode analysis to neural networks , volume =
Mohr, Ryan and Fonoberova, Maria and Manojlović, Iva and Andrejčuk, Aleksandr and Drmač, Zlatko and Kevrekidis, Yannis and Mezić, Igor , year =. Applications of koopman mode analysis to neural networks , volume =
-
[3]
Shopkhoev, Dmitriy and Ali, Ammar and Zhussip, Magauiya and Malykh, Valentin and Lefkimmiatis, Stamatios and Komodakis, Nikos and Zagoruyko, Sergey , year =. Advances in
-
[4]
arXiv:2508.10104 , author =
work page internal anchor Pith review Pith/arXiv arXiv
- [5]
-
[6]
Men, Xin and Xu, Mingyu and Zhang, Qingyu and Yuan, Qianhao and Wang, Bingning and Lin, Hongyu and Lu, Yaojie and Han, Xianpei and Chen, Weipeng , editor =. Findings of the. 2025 , pages =. doi:10.18653/v1/2025.findings-acl.1035 , abstract =
-
[7]
Razzhigaev, Anton and Mikhalchuk, Matvey and Goncharova, Elizaveta and Gerasimenko, Nikolai and Oseledets, Ivan and Dimitrov, Denis and Kuznetsov, Andrey , year =. Your. doi:10.18653/v1/2024.acl-long.293 , booktitle =
-
[8]
Chen, Ricky T. Q. and Rubanova, Yulia and Bettencourt, Jesse and Duvenaud, David K , year =. Neural. Advances in
-
[9]
Residual networks behave like ensembles of relatively shallow networks , isbn =
Veit, Andreas and Wilber, Michael and Belongie, Serge , month = dec, year =. Residual networks behave like ensembles of relatively shallow networks , isbn =. Proceedings of the 30th
-
[10]
Journal of Machine Learning Research , author =
Statistical. Journal of Machine Learning Research , author =. 2006 , pages =
work page 2006
-
[11]
Transactions on Machine Learning Research , author =
-
[12]
Darcet, Timothée and Oquab, Maxime and Mairal, Julien and Bojanowski, Piotr , year =. Vision. The
-
[13]
Jacobs, Mozes and Fel, Thomas and Hakim, Richard and Brondetta, Alessandra and Ba, Demba E. and Keller, T. Anderson , year =. Block. The
-
[14]
Schmid, Peter J. , year =. Dynamic mode decomposition of numerical and experimental data , volume =. doi:10.1017/S0022112010001217 , journal =
-
[15]
Editing conditional radiance fields
Touvron, Hugo and Cord, Matthieu and Sablayrolles, Alexandre and Synnaeve, Gabriel and Jegou, Herve , year =. Going deeper with. doi:10.1109/ICCV48922.2021.00010 , booktitle =
-
[16]
Nguyen, Thao and Raghu, Maithra and Kornblith, Simon , year =. Do. The
-
[17]
On dynamic mode decomposition:. Journal of Computational Dynamics , author =. 2014 , pages =. doi:10.3934/jcd.2014.1.391 , number =
-
[18]
Salim Dahdah and James Richard Forbes
Modern. SIAM Review , author =. 2022 , pages =. doi:10.1137/21M1401243 , number =
-
[19]
Kostic, Vladimir R. and Novelli, Pietro and Maurer, Andreas and Ciliberto, Carlo and Rosasco, Lorenzo and Pontil, Massimiliano , year =. Learning. Advances in
-
[20]
Communications in Mathematics and Statistics , year=
A. Communications in Mathematics and Statistics , author =. 2017 , pages =. doi:10.1007/s40304-017-0103-z , number =
-
[21]
Aswani, Nishant Suresh and Jabari, Saif and Shafique, Muhammad , year =. Representing
-
[22]
Aubry, Murdock and Meng, Haoming and Sugolov, Anton and Papyan, Vardan , year =. Transformer. The
-
[23]
Gai, Kuo and Zhang, Shihua , year =. A
- [24]
-
[25]
Stable architectures for deep neural networks , volume =. Inverse Problems , author =. 2017 , pages =. doi:10.1088/1361-6420/aa9a90 , number =
-
[26]
Aswani, Nishant Suresh and Jabari, Saif , year =. Koopman
-
[27]
Kutz, J. Nathan and Brunton, Steven L. and Brunton, Bingni W. and Proctor, Joshua L. , year =. Dynamic. doi:10.1137/1.9781611974508.ch1 , booktitle =
-
[28]
Journal of Statistical Mechanics: Theory and Experiment , author =
Extraction of nonlinearity in neural networks with. Journal of Statistical Mechanics: Theory and Experiment , author =. 2024 , pages =. doi:10.1088/1742-5468/ad5713 , number =
-
[29]
Dogra, Akshunna S. and Redman, William , year =. Optimizing. Advances in
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.