pith. sign in

arxiv: 2607.01987 · v1 · pith:DUXV6TP5new · submitted 2026-07-02 · 💻 cs.CV

Understanding Geometric Representations in Self-Supervised Vision Transformers via Subspace Intervention

Pith reviewed 2026-07-03 15:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords Vision Transformersself-supervised learninggeometric representationslinear probingsingular value decompositionsubspace analysisDINOv2MAE
0
0 comments X

The pith

Decomposing linear probe weights via SVD isolates low-rank subspaces carrying explicit geometric signals in self-supervised ViTs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a subspace intervention framework that applies singular value decomposition to the weights of converged linear probes on geometric tasks. This decomposition separates the low-rank components that encode spatial information from the remaining feature space in Vision Transformers trained with self-supervision. The resulting view shows that pre-training objectives shape how features align or disperse, that geometric signals occupy highly compressible subspaces, and that geometric precision is strongest in intermediate layers before later layers shift toward semantic abstraction.

Core claim

By decomposing the weights of converged linear probes via SVD, the framework isolates low-rank subspaces that contain explicit geometric signals. This yields three observations on the paper's terms: pre-training objectives determine encoding, with DINOv2 aligning spatial features for efficient linear extraction while MAE disperses signals and requires broader context; explicit geometric representations are highly compressible; and layer-wise task affinity shows geometric precision peaking at intermediate layers before yielding to semantic abstraction in final layers.

What carries the argument

Subspace intervention framework that decomposes converged linear probe weights using Singular Value Decomposition to isolate low-rank subspaces containing explicit geometric signals.

If this is right

  • DINOv2 features support more efficient linear extraction of spatial information than MAE features.
  • Geometric representations remain effective when restricted to low-rank subspaces, allowing dense heads to use fewer parameters.
  • Geometric task performance peaks in intermediate ViT layers rather than final layers.
  • Feature selection for geometry-related tasks can target the identified subspaces without large performance loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Lightweight decoder designs for dense prediction could constrain weights to the low-rank geometric subspaces identified by the method.
  • Downstream applications might select intermediate-layer features when geometry matters more than semantics.
  • The intervention technique could be applied to probe other non-semantic properties encoded in the same representations.

Load-bearing premise

The decomposition of converged linear probe weights via SVD isolates the low-rank subspaces that contain explicit geometric signals in the ViT representations.

What would settle it

A controlled experiment that measures geometric task accuracy after zeroing out only the identified low-rank SVD components versus zeroing out random components of equal rank would falsify the claim if the two interventions produce equivalent drops.

Figures

Figures reproduced from arXiv: 2607.01987 by Chao Zhang, Chunzhi Gu, Haoran Xie, Ran Dong, Weichen Zhou, Yawen Zou.

Figure 1
Figure 1. Figure 1: Overview of the controlled subspace intervention analysis framework. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Absolute Recovery: Solid colored lines depict the performance of the task-aligned subspace (VkV⊤ k ). The gray shaded area denotes the noise floor (SA-δ1 < 0.18), accounting for both the residual subspace and the random orthog￾onal baselines under the frozen probe. This gap indicates substantial representational redundancy, suggesting that explicitly decodable geometric information can be com￾pressed i… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative Visualization of Subspace Interventions. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Energy Distribution Across Singular Directions. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Layer-wise Subspace Analysis. Performance evolution of geometric repre￾sentations across different layers and subspace ranks. The Entanglement Gap highlights the disparity between linear and nonlinear decoding, indicating the degree of local non￾linear folding. The rank dispersion gap highlights the performance drop under extreme low-rank constraints (k = 8). Notice the distinct rank sensitivity transition… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative Task Affinity in DINOv2. Visual predictions decoded from frozen features at layers 18, 21, and 24. Performance for (a)surface normals, (b)depth, and (c)semantic segmentation peaks at these layers, respectively. This variation demon￾strates distinct layer-wise affinities determined by the downstream task objectives. iBOT exhibits some early-layer alignment, it does not achieve the low-rank com￾p… view at source ↗
Figure 7
Figure 7. Figure 7: Normalized probing perfor￾mance. Geometric tasks peak and de￾cline early, whereas high-level semantic abstraction peaks latest. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Subspace Stability. Lines and color shaded regions denote the mean and standard deviation of subspace similarity across three random seeds over ranks k. High consistency at low ranks (k ≤ 16) rules out optimization artifacts. To rule out optimization artifacts, we evaluate the subspace similarity of the extracted bases (Vk) across three ran￾dom probe initializations [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
read the original abstract

We introduce a controlled subspace intervention framework to investigate how self-supervised Vision Transformers (ViTs) encode dense geometric information. While linear probing is widely used to assess geometric representations, it treats features as a black box, failing to disentangle the underlying topology. To address this issue, we decompose the weights of converged linear probes to isolate the low-rank subspaces containing explicit geometric signals using Singular Value Decomposition (SVD). Our perspective yields three key insights: (1) Pre-training objectives determine how features are encoded. DINOv2 aligns spatial features for efficient linear extraction, while Masked Autoencoders (MAE) tend to disperse these signals, requiring a broader spatial context. (2) Explicit geometric representations are highly compressible, suggesting dense predictive heads could potentially be constrained to low-rank subspaces with minimal performance loss. (3) The layer-wise task affinity suggests that geometric precision peaks at intermediate layers before yielding to semantic abstraction in the final layers. By connecting internal encoding mechanics with downstream performance, these findings provide a basis for effective feature selection and lightweight decoder design. The source code is available at https://github.com/Zhou-Weichen/Geosubprobe.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a subspace intervention framework to analyze how self-supervised ViTs encode dense geometric information. It decomposes converged linear probe weights via SVD to isolate low-rank subspaces containing explicit geometric signals, yielding three insights: (1) pre-training objectives determine encoding (DINOv2 aligns spatial features efficiently while MAE disperses them), (2) explicit geometric representations are highly compressible, and (3) geometric precision peaks at intermediate layers before semantic abstraction dominates.

Significance. If the SVD-based identification of geometric subspaces is validated, the framework could link internal encoding mechanics to downstream performance and support efficient feature selection or low-rank decoder designs. The public code release aids reproducibility, though the absence of quantitative results in the abstract limits immediate assessment of impact.

major comments (2)
  1. [Method / Abstract] The core identification assumption—that SVD on converged linear probe weights isolates low-rank subspaces specifically encoding explicit geometric signals rather than any sufficient statistic for the probe task—is untested and load-bearing for all three insights. Linear probes can exploit entangled or spurious correlations, and SVD yields an orthogonal basis ordered by probe variance without reference to original feature geometry or control tasks (see skeptic analysis and abstract description of the framework).
  2. [Abstract] The abstract outlines the method and states three insights but includes no numerical results, validation experiments, or details on derivation; the full manuscript must supply quantitative evidence (e.g., performance deltas under subspace intervention, compressibility curves, layer-wise metrics) to support claims about alignment, compressibility, and layer affinity.
minor comments (2)
  1. [Abstract] Ensure the GitHub link is explicitly stated in the abstract and methods for the claimed code release.
  2. [Method] Clarify notation for the probe weight matrix W, its SVD decomposition, and the resulting subspaces V_k to avoid ambiguity in the intervention procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and proposed revisions.

read point-by-point responses
  1. Referee: [Method / Abstract] The core identification assumption—that SVD on converged linear probe weights isolates low-rank subspaces specifically encoding explicit geometric signals rather than any sufficient statistic for the probe task—is untested and load-bearing for all three insights. Linear probes can exploit entangled or spurious correlations, and SVD yields an orthogonal basis ordered by probe variance without reference to original feature geometry or control tasks.

    Authors: We acknowledge the importance of validating that the subspaces capture geometric signals specifically. The manuscript supports this through cross-model comparisons (DINOv2 vs. MAE) showing distinct encoding patterns that align with known pre-training differences, and through intervention results where subspace removal degrades geometric tasks more than controls would predict. To strengthen against concerns of spurious correlations, we will add ablations with non-geometric control probes and random subspace baselines in the revision. revision: partial

  2. Referee: [Abstract] The abstract outlines the method and states three insights but includes no numerical results, validation experiments, or details on derivation; the full manuscript must supply quantitative evidence (e.g., performance deltas under subspace intervention, compressibility curves, layer-wise metrics) to support claims about alignment, compressibility, and layer affinity.

    Authors: We agree that the abstract would benefit from quantitative support. We will revise it to include key metrics such as compressibility ratios from SVD, layer-wise peak accuracies, and intervention performance deltas. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is empirical analysis of existing models

full rationale

The paper trains linear probes on ViT features for geometric tasks, decomposes converged probe weights W via SVD to obtain subspaces V_k, and reports observational patterns across DINOv2/MAE models and layers. These steps use standard techniques (linear probing + SVD) whose outputs are not equivalent to the inputs by construction, nor justified by self-citation chains. The three insights follow from measured performance differences after subspace interventions, which remain falsifiable against held-out data and control tasks. No load-bearing step reduces to a fitted parameter renamed as prediction or an ansatz smuggled via prior work by the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the domain assumption that linear probing followed by SVD decomposition reveals the internal geometric encoding mechanics of the transformer features.

axioms (1)
  • domain assumption The weights of linear probes trained on ViT features encode geometric information in a way that SVD can isolate explicit low-rank subspaces.
    This is the core premise enabling the isolation of geometric signals.

pith-pipeline@v0.9.1-grok · 5743 in / 1203 out tokens · 36967 ms · 2026-07-03T15:33:12.736507+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 15 canonical work pages · 5 internal anchors

  1. [1]

    In: Annual Meeting of the Association for Computational Linguistics (2020)

    Aghajanyan, A., Zettlemoyer, L., Gupta, S.: Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In: Annual Meeting of the Association for Computational Linguistics (2020)

  2. [2]

    Understanding intermediate layers using linear classifier probes

    Alain, G., Bengio, Y.: Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644 (2016)

  3. [3]

    arXiv preprint arXiv:2601.09954 (2026)

    Alam, N., Murali, L.K., Bharadwaj, S., Liu, P., Chung, T., Sharma, D., Kiran, K., Tam, W., Vegesna, B.K.S., et al.: The spatial blindspot of vision-language models. arXiv preprint arXiv:2601.09954 (2026)

  4. [4]

    arXiv preprint arXiv:2112.058142(3), 4 (2021)

    Amir, S., Gandelsman, Y., Bagon, S., Dekel, T.: Deep vit features as dense visual descriptors. arXiv preprint arXiv:2112.058142(3), 4 (2021)

  5. [5]

    In: Neural Information Processing Systems (2019)

    Ansuini, A., Laio, A., Macke, J.H., Zoccolan, D.: Intrinsic dimension of data rep- resentations in deep neural networks. In: Neural Information Processing Systems (2019)

  6. [6]

    In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision

    Bae, G., Budvytis, I., Cipolla, R.: Estimating and exploiting the aleatoric uncer- tainty in surface normal estimation. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 13137–13146 (2021) 16 W. Zhou et al

  7. [7]

    arXiv preprint arXiv:2503.18762 (2025)

    Bahador, N.: Mechanistic interpretability of fine-tuned vision transformers on dis- torted images: Decoding attention head behavior for transparent and trustworthy ai. arXiv preprint arXiv:2503.18762 (2025)

  8. [8]

    2024 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) pp

    Banani, M.E., Raj, A., Maninis, K.K., Kar, A., Li, Y., Rubinstein, M., Sun, D., Guibas, L.J., Johnson, J., Jampani, V.: Probing the 3d awareness of visual founda- tion models. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) pp. 21795–21806 (2024)

  9. [9]

    Compu- tational Linguistics48, 207–219 (2021)

    Belinkov, Y.: Probing classifiers: Promises, shortcomings, and advances. Compu- tational Linguistics48, 207–219 (2021)

  10. [10]

    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

    Bhat, S.F., Alhashim, I., Wonka, P.: Adabins: Depth estimation using adaptive bins. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 4008–4017 (2020)

  11. [11]

    7068–7077 (2021)

    Cao, J., Leng, H., Lischinski, D., Cohen-Or, D., Tu, C., Li, Y.: Shapeconv: Shape- awareconvolutionallayerforindoorrgb-dsemanticsegmentation.2021IEEE/CVF International Conference on Computer Vision (ICCV) pp. 7068–7077 (2021)

  12. [12]

    2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp

    Caron, M., Touvron, H., Misra, I., J’egou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 9630–9640 (2021)

  13. [13]

    2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

    Chen, X., Marks, M., Cheng, Z.: Probing the mid-level vision capabilities of self- supervised learning. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 30095–30105 (2024)

  14. [14]

    In: International conference on machine learning

    Dong, Y., Cordonnier, J.B., Loukas, A.: Attention is not all you need: Pure atten- tion loses rank doubly exponentially with depth. In: International conference on machine learning. pp. 2793–2803. PMLR (2021)

  15. [15]

    In: Neural Information Processing Systems (2014)

    Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Neural Information Processing Systems (2014)

  16. [16]

    Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry

    Fel, T., Wang, B., Lepori, M.A., Kowal, M., Lee, A., Balestriero, R., Joseph, S., Lubana, E.S., Konkle, T., Ba, D., et al.: Into the rabbit hull: From task-relevant concepts in dino to minkowski geometry. arXiv preprint arXiv:2510.08638 (2025)

  17. [17]

    arXiv preprint arXiv:2410.23169 (2024)

    Garrod, C., Keating, J.P.: The persistence of neural collapse despite low-rank bias. arXiv preprint arXiv:2410.23169 (2024)

  18. [18]

    The international journal of robotics research32(11), 1231–1237 (2013)

    Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. The international journal of robotics research32(11), 1231–1237 (2013)

  19. [19]

    In: 2012 IEEE conference on computer vision and pattern recognition

    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE conference on computer vision and pattern recognition. pp. 3354–3361. IEEE (2012)

  20. [20]

    2019 IEEE/CVF International Confer- ence on Computer Vision (ICCV) pp

    Goyal, P., Mahajan, D.K., Gupta, A.K., Misra, I.: Scaling and benchmarking self- supervised visual representation learning. 2019 IEEE/CVF International Confer- ence on Computer Vision (ICCV) pp. 6390–6399 (2019)

  21. [21]

    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

    He, K., Chen, X., Xie, S., Li, Y., Doll’ar, P., Girshick, R.B.: Masked autoencoders are scalable vision learners. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 15979–15988 (2021)

  22. [22]

    Hewitt, J., Liang, P.: Designing and interpreting probes with control tasks. In: Proceedings of the 2019 conference on empirical methods in natural language pro- cessing and the 9th international joint conference on natural language processing (emnlp-ijcnlp). pp. 2733–2743 (2019)

  23. [23]

    IEEE Transactions on Pattern Analysis and Machine Intelligence46, 2506–2517 (2022) Understanding ViTs via Subspace Intervention 17

    Huang, Z., Jin, X., Lu, C., Hou, Q., Cheng, M.M., Fu, D., Shen, X., Feng, J.: Contrastive masked autoencoders are stronger vision learners. IEEE Transactions on Pattern Analysis and Machine Intelligence46, 2506–2517 (2022) Understanding ViTs via Subspace Intervention 17

  24. [24]

    Advances in Neural Information Processing Systems36, 76061–76084 (2023)

    Jampani, V., Maninis, K.K., Engelhardt, A., Karpur, A., Truong, K., Sargent, K., Popov, S., Araujo, A., Martin Brualla, R., Patel, K., et al.: Navi: Category-agnostic image collections with high-quality 3d shape and pose annotations. Advances in Neural Information Processing Systems36, 76061–76084 (2023)

  25. [25]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Jevtić, A., Reich, C., Wimbauer, F., Hahn, O., Rupprecht, C., Roth, S., Cre- mers, D.: Feed-forward scenedino for unsupervised semantic scene completion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6784–6796 (2025)

  26. [26]

    arXiv preprint arXiv:2408.17059 (2024)

    Khan, A., Sohail, A., Fiaz, M., Hassan, M., Afridi, T.H., Marwat, S.U., Munir, F., Ali, S., Naseem, H., Zaheer, M.Z., et al.: A survey of the self supervised learning mechanisms for vision transformers. arXiv preprint arXiv:2408.17059 (2024)

  27. [27]

    arXiv preprint arXiv:2602.10099 (2026)

    Kumar, A., Patel, V.M.: Learning on the manifold: Unlocking standard diffusion transformers with representation encoders. arXiv preprint arXiv:2602.10099 (2026)

  28. [28]

    Measuring the Intrinsic Dimension of Objective Landscapes

    Li, C., Farkhoor, H., Liu, R., Yosinski, J.: Measuring the intrinsic dimension of objective landscapes. arXiv preprint arXiv:1804.08838 (2018)

  29. [29]

    Advances in Neural Information Processing Systems37, 76819–76847 (2024)

    Man, Y., Zheng, S., Bao, Z., Hebert, M., Gui, L., Wang, Y.X.: Lexicon3d: Probing visual foundation models for complex 3d scene understanding. Advances in Neural Information Processing Systems37, 76819–76847 (2024)

  30. [30]

    arXiv preprint arXiv:2403.05056 (2024)

    Mao, Y., Liu, J., Liu, X.: Stealing stable diffusion prior for robust monocular depth estimation. arXiv preprint arXiv:2403.05056 (2024)

  31. [31]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  32. [32]

    Proceedings of the National Academy of Sciences of the United States of America117, 24652 – 24663 (2020)

    Papyan, V., Han, X., Donoho, D.L.: Prevalence of neural collapse during the ter- minal phase of deep learning training. Proceedings of the National Academy of Sciences of the United States of America117, 24652 – 24663 (2020)

  33. [33]

    Park, N., Kim, W., Heo, B., Kim, T., Yun, S.: What do self-supervised vision transformers learn? arXiv preprint arXiv:2305.00729 (2023)

  34. [34]

    arXiv preprint arXiv:2410.07687 (2024)

    Patel, N., Shwartz-Ziv, R.: Learning to compress: Local rank and information com- pression in deep neural networks. arXiv preprint arXiv:2410.07687 (2024)

  35. [35]

    In: Annual Meeting of the Association for Computational Linguistics (2020)

    Pimentel, T., Valvoda, J., Maudslay, R.H., Zmigrod, R., Williams, A., Cotterell, R.: Information-theoretic probing for linguistic structure. In: Annual Meeting of the Association for Computational Linguistics (2020)

  36. [36]

    arXiv preprint arXiv:2104.08894 (2021)

    Pope, P., Zhu, C., Abdelkader, A., Goldblum, M., Goldstein, T.: The intrinsic dimension of images and its impact on learning. arXiv preprint arXiv:2104.08894 (2021)

  37. [37]

    2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp

    Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 12159– 12168 (2021)

  38. [38]

    In: European Conference on Computer Vision (2012)

    Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: European Conference on Computer Vision (2012)

  39. [39]

    2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotný, D.: Vggt: Visual geometry grounded transformer. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 5294–5306 (2025)

  40. [40]

    In: International conference on ma- chine learning

    Wang, T., Isola, P.: Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In: International conference on ma- chine learning. pp. 9929–9939. PMLR (2020)

  41. [41]

    2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

    Xie, Z., Geng, Z., Hu, J., Zhang, Z., Hu, H., Cao, Y.: Revealing the dark secrets of masked image modeling. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 14475–14485 (2022) 18 W. Zhou et al

  42. [42]

    2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

    Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 10371–10381 (2024)

  43. [43]

    Advances in Neural Information Processing Systems37, 21875–21911 (2024)

    Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. Advances in Neural Information Processing Systems37, 21875–21911 (2024)

  44. [44]

    Advances in Neural Information Pro- cessing Systems 37 (2023)

    Zhan, G., Zheng, C., Xie, W., Zisserman, A.: A general protocol to probe large vision models for 3d physical understanding. Advances in Neural Information Pro- cessing Systems 37 (2023)

  45. [45]

    2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

    Zhang, J., Herrmann, C., Hur, J., Chen, E., Jampani, V., Sun, D., Yang, M.H.: Telling left from right: Identifying geometry-aware semantic correspondence. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 3076–3085 (2023)

  46. [46]

    Advances in Neural Information Processing Systems35, 27127–27139 (2022)

    Zhang, Q., Wang, Y., Wang, Y.: How mask matters: Towards theoretical under- standings of masked autoencoders. Advances in Neural Information Processing Systems35, 27127–27139 (2022)

  47. [47]

    iBOT: Image BERT Pre-Training with Online Tokenizer

    Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A., Kong, T.: ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832 (2021)