pith. sign in

arxiv: 2605.29691 · v1 · pith:HNID4SK7new · submitted 2026-05-28 · 💻 cs.CV

Unsupervised Semantic Segmentation Facilitates Model Understanding

Pith reviewed 2026-06-29 08:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords self-supervised learningvision transformersunsupervised semantic segmentationmodel understandingpositional biasvisualization protocollocality bias
0
0 comments X

The pith

A visualization protocol using unsupervised semantic segmentation reveals consistent positional biases and scaling behaviors in self-supervised vision transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a simple visualization method based on unsupervised semantic segmentation applied to pretrained vision transformer representations. The goal is not improved segmentation accuracy but to make model-specific behaviors visible and comparable across different images and training approaches. Benchmarking multiple SSL models shows distinct positional biases, scaling patterns, and artifacts such as boundary effects in DINOv3-Large tokens. The approach also separates positional effects from locality bias in a direct visual manner. This protocol aims to communicate existing findings on model mechanics to a wider audience.

Core claim

Applying unsupervised semantic segmentation to the representations of various self-supervised vision transformers and visualizing the outputs conveys model behaviors that emerge consistently, including distinct positional biases and scaling behaviors, without any optimization for segmentation performance.

What carries the argument

The visualization protocol that renders unsupervised semantic segmentation results from model representations to expose biases and differences across layers.

If this is right

  • Insights into distinct positional biases between contrastive and masked image modeling approaches become directly observable.
  • Strong boundary artifacts appear specifically in DINOv3-Large model tokens.
  • Positional effects can be visually separated from the related locality bias.
  • Previous analyses of self-attention and captured information gain an intuitive visual complement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The protocol could serve as a quick diagnostic when evaluating newly trained SSL models for unintended biases.
  • It may extend to comparing representations from supervised versus self-supervised training on the same architecture.
  • Visual patterns identified this way could guide targeted interventions to reduce specific artifacts during pretraining.

Load-bearing premise

That unsupervised semantic segmentation results, even when not optimized for segmentation accuracy, reliably surface consistent model behaviors that can be visually distinguished across images and training paradigms.

What would settle it

Running the protocol on the same set of models and images but obtaining visualizations with no repeatable differences between models or layers.

Figures

Figures reproduced from arXiv: 2605.29691 by Andreas Mardt, Dagmar Kainm\"uller, Jannik Franzen, Lisa Mais, Nick Lechtenb\"orger, Peter Hirsch, Xiaoyan Yu.

Figure 1
Figure 1. Figure 1: Overview of our protocol. We here introduce the individual elements of the protocol, and at the same time showcase exemplary insights it facilitates. Tho this end, across rows, we vary the SSL training paradigm at fixed ViT-B architecture and fixed ViT layer selection strategy by best mIoU. a) Qualitative elements: (a1) Joint visualization of unsupervised semantic segmentation results on exemplary sets of … view at source ↗
Figure 2
Figure 2. Figure 2: Canonical embeddings produced by a Vision Transformer: keys, queries, and values of the Multi-Head Attention (MHA) blocks, and tokens, defined as the feed￾forward network (FFN) outputs for patch tokens. Each of these types of embeddings is at hand at each layer of the ViT. Regard￾ing the aggregation of per-head embeddings stemming from the MHA block: For any given layer, we concatenate per-head em￾beddings… view at source ↗
Figure 3
Figure 3. Figure 3: Performance of key, query, value, and token embeddings on the unsuper￾vised semantic segmentation task across SSL models and model sizes. a) displays the layer-wise clustering performance based on key and token embeddings for four selected models. b) shows the overall performance across all models using their respec￾tive best-performing layer. Models highlighted in blue exhibit notably low segmentation per… view at source ↗
Figure 4
Figure 4. Figure 4: Model behavior across layers: Unsupervised semantic segmentation results for three exemplary models: a) DINO, b) MAE, and c) DINOv3. Top: Key embed￾dings: The three models exhibit distinct patterns of positional effect: it is confined to earlier layers in DINO, strongly emerges in intermediate layers in MAE, and persists strongly across all layers in DINOv3. Bottom: Token embeddings: Positional effect is l… view at source ↗
Figure 5
Figure 5. Figure 5: Behavior across scales: Clustering results for token embeddings from the best-performing layer of the Base and Large variants of DINOv3, iBOT, DINOv2 and DINOv2+reg. We observe a notably increased positional effect in DINOv3-L compared to its -B counterpart alongside inferior segmentation performance in terms of mIoU. DINOv2 exhibits a similar pattern. For iBOT, we observe notable positional effects across… view at source ↗
Figure 6
Figure 6. Figure 6: Positive scaling effect in iBOT: Unsupervised segmentation results for token embeddings from the best-performing layer of the Base and Large variants of iBOT focusing on animal images. We observe more fine-grained clustering of object parts for the larger model. For example, in the Large model, head regions of humans and different animals are assigned to more similar embeddings compared to the Base variant… view at source ↗
Figure 7
Figure 7. Figure 7: Locality Biases observable in Attention Matrices. a) Attention matrices for multiple models, averaged over attention heads. For improved visualization, maps were binarized w.r.t. their 95th percentile. b) Exemplary attention maps for two different queries, illustrating the stronger locality of attention in DINOv2+reg compared to DINO. c) Vertically “barcoded” attention patterns indicate zero mutual informa… view at source ↗
Figure 8
Figure 8. Figure 8: Left: Correlation between positional effect and segmentation performance (in log-scale) for all models and layers, across Keys, Queries, Values, and Tokens. Right: Correlation between positional effect and locality bias in Key and Query Embeddings [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Left: Impact of PCA-based centroid initialization on final inference perfor￾mance for DINOv3 Base model. Results are reported across three different subsets of training patches used for PCA initialization. Right: Impact of batch size on final in￾ference performance for DINOv3 Base model. A.3 Unsupervised semantic segmentation layer-wise performance [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Performance of key, query, value and token embeddings on unsupervised semantic segmentation task across layers and model sizes. a reference baseline exhibiting comparatively little positional bias. Figure 11b presents token embeddings from Base model variants, where segmentation ar￾tifacts are visible in both CLIP and DINOv2 results, consistent with findings reported in prior work [32]. Finally, Figure 11… view at source ↗
Figure 11
Figure 11. Figure 11: Additional examples of unsupervised semantic segmentation across [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Overclustering of dataset semantics into 97 clusters. Large variants of iBOT and DINOv2+reg exhibit finer-grained part-level clustering, separating regions such as eyes, ears, neck and trunk. In contrast, increasing the number of clusters does not lead to comparable part-level delineation in the Base variants [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: replicates the cross-model comparison of key embeddings on the Pas￾calPart (K = 6) and Cityscapes(K = 27) datasets. For both datasets, we adopt the same unsupervised segmentation training settings, as described in Sec. A.2. For PascalPart, we use only the animal subsets, resulting in approximately 1,077 training images and 1,094 test images [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: presents the layer-wise performance on zero-shot semantic correspon￾dence for both base and large models across all four embedding types. (a) key embedding (b) query embedding (c) value embedding (d) token embedding [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Semantic correspondence qualitative results of token embeddings from three representative SSL ViT base models. We present three example image pairs to com￾pare the qualitative performance of the best-performing layer and the final layer embeddings from (a) Supervised ViT, (b) CLIP, and (c) DINOv3. Heatmaps visualize the cosine similarity between a selected keypoint in the reference image (red dot) and all… view at source ↗
Figure 16
Figure 16. Figure 16: Semantic correspondence results using supervised ViT (token), CLIP (to￾ken), and MAE (value) features. green marks the ground-truth correspondence in the target image and the highest-response locations in the heatmaps. White circles indi￾cate regions counted as true positives under the PCK metric. The best-performing layer is highlighted with a red rectangle. In the Large variants, supervised ViT feature … view at source ↗
Figure 17
Figure 17. Figure 17: Semantic correspondence results using Mugs (token) and iBOT (token) fea￾tures. The iBOT large model captures finer semantic structures compared to the base model. Same notation as [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Semantic correspondence results using DINOv2 (token), DINOv2+reg (token) and DINOv3 (token) features. Same notation as [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗
read the original abstract

Self-supervised learning (SSL) has produced a diverse landscape of vision transformers (ViTs) whose pretrained representations support a wide range of downstream tasks. Towards a better understanding of these models, a body of work has assessed the mechanics of their self-attention as well as the types of information captured across their representations, revealing, for example, stark differences between models trained with contrastive learning (CL) and masked image modeling (MIM). However, these advances in model understanding have not yet fully permeated the broader community, where insights specific to CL models are sometimes generalized to MIM models. To make model understanding straightforward and intuitive for a broad audience, we propose a simple and easily interpretable visualization protocol. Our protocol is based on visualizing unsupervised semantic segmentation results, yet our goal is not to maximize segmentation performance. Instead, it allows us to convey model behaviors that consistently emerge across images. Benchmarking a diverse set of SSL models across layers and representations, we obtain novel insights into distinct positional biases and scaling behaviors, including strong boundary artifacts in DINOv3-Large model tokens. These insights complement and help communicate a range of previous findings. Our protocol further enables a clear visual distinction between positional effects and the closely related but distinct locality bias, which has been studied much more extensively in the literature. The protocol is publicly available on GitHub and we believe it will catalyze further model understanding for a broad community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a simple visualization protocol based on unsupervised semantic segmentation outputs from self-supervised vision transformers. The goal is not to achieve high segmentation accuracy but to provide an intuitive way to convey consistent model behaviors (e.g., positional biases, scaling behaviors, and boundary artifacts such as those observed in DINOv3-Large) across images and training paradigms. The work benchmarks multiple SSL models (CL and MIM) across layers and representations, distinguishes positional effects from locality bias, and releases the protocol publicly on GitHub to aid broader model understanding.

Significance. If the visualizations reliably and consistently surface distinguishable behaviors that align with and communicate prior findings on CL vs. MIM differences, the protocol could serve as an accessible tool for the community. The explicit public code release strengthens reproducibility and potential adoption.

major comments (2)
  1. [Abstract] Abstract: the central claim that the protocol yields 'novel insights' into positional biases, scaling behaviors, and boundary artifacts rests on qualitative observations, yet the provided text contains no quantitative metrics, consistency measures across images, dataset details, or error analysis to substantiate that the surfaced behaviors are reliable and not visualization artifacts.
  2. The assumption that unsupervised segmentation results (even when unoptimized for accuracy) reliably surface consistent, visually distinguishable model behaviors across training paradigms is load-bearing for the protocol's utility; without explicit validation (e.g., inter-image agreement statistics or comparison to known attention patterns), it remains unclear whether the method generalizes beyond selected examples.
minor comments (1)
  1. The distinction between positional effects and locality bias is stated as a contribution; the manuscript should include a dedicated paragraph or figure explicitly contrasting the two with references to prior locality-bias literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. Our protocol is intentionally qualitative and visualization-focused to make model behaviors accessible, rather than a quantitative benchmark. We address the concerns point-by-point below with targeted revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the protocol yields 'novel insights' into positional biases, scaling behaviors, and boundary artifacts rests on qualitative observations, yet the provided text contains no quantitative metrics, consistency measures across images, dataset details, or error analysis to substantiate that the surfaced behaviors are reliable and not visualization artifacts.

    Authors: We agree the abstract should better contextualize the qualitative nature of the work. The protocol's purpose is to surface visually consistent behaviors across images for intuitive understanding, complementing (not replacing) quantitative studies in the literature. We will revise the abstract and add a dedicated paragraph in Section 3 to specify the evaluation images (sampled from ImageNet validation), note that consistency is demonstrated via repeated patterns across diverse examples in the figures and supplement, and explicitly state that no quantitative segmentation metrics or error analysis are claimed. This preserves the contribution's focus on accessibility while addressing the request for transparency. revision: partial

  2. Referee: The assumption that unsupervised segmentation results (even when unoptimized for accuracy) reliably surface consistent, visually distinguishable model behaviors across training paradigms is load-bearing for the protocol's utility; without explicit validation (e.g., inter-image agreement statistics or comparison to known attention patterns), it remains unclear whether the method generalizes beyond selected examples.

    Authors: The manuscript already benchmarks multiple SSL models (CL and MIM) across layers and shows behaviors that align with established differences (e.g., locality in CL vs. global in MIM). We will expand the discussion section to include explicit references to prior attention-map analyses that corroborate the observed positional biases and boundary artifacts, and add a short paragraph on generalization by noting that the released code enables verification on arbitrary images. We maintain that inter-image agreement statistics fall outside the protocol's scope, as it is not positioned as a validated segmentation method but as a visualization aid; adding such metrics would require a separate quantitative study. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes a visualization protocol using unsupervised semantic segmentation outputs to convey consistent model behaviors in SSL ViTs, explicitly stating the goal is not segmentation accuracy. No equations, fitted parameters, predictions, or derivation chains appear in the abstract or description. Benchmarking across layers for positional biases and boundary artifacts is presented as an independent qualitative method with public code release. No self-citation load-bearing steps, self-definitional elements, or reductions of claims to inputs by construction are identifiable, rendering the protocol self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified or extractable.

pith-pipeline@v0.9.1-grok · 5803 in / 1044 out tokens · 28641 ms · 2026-06-29T08:34:58.161485+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 19 canonical work pages · 7 internal anchors

  1. [1]

    arXiv preprint arXiv:2112.058142(3), 4 (2021)

    Amir, S., Gandelsman, Y., Bagon, S., Dekel, T.: Deep vit features as dense visual descriptors. arXiv preprint arXiv:2112.058142(3), 4 (2021)

  2. [2]

    Balestriero, R., LeCun, Y.: Lejepa: Provable and scalable self-supervised learning without the heuristics (2025),https://arxiv.org/abs/2511.08544

  3. [3]

    Perception Encoder: The best visual embeddings are not at the output of the network

    Bolya, D., Huang, P.Y., Sun, P., Cho, J.H., Madotto, A., Wei, C., Ma, T., Zhi, J., Rajasegaran, J., Rasheed, H., et al.: Perception encoder: The best visual em- beddings are not at the output of the network. arXiv preprint arXiv:2504.13181 (2025)

  4. [4]

    In:ProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition

    Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: Thing and stuff classes in context. In:ProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition. pp. 1209–1218 (2018) 18 X. Yu et al

  5. [5]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

  6. [6]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., Yuille, A.: Detect what you can: Detecting and representing objects using holistic models and body parts. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1971–1978 (2014)

  7. [7]

    In: Proceedings of the IEEE/CVF international conference on com- puter vision

    Chen, X., Xie, S., He, K.: An empirical study of training self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on com- puter vision. pp. 9640–9649 (2021)

  8. [8]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)

  9. [9]

    Vision Transformers Need Registers

    Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision transformers need reg- isters. arXiv preprint arXiv:2309.16588 (2023)

  10. [10]

    Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P., Heek, J., Gilmer, J., Steiner, A., Caron, M., Geirhos, R., Alabdulmohsin, I., Jenatton, R., Beyer, L., Tschannen, M., Arnab, A., Wang, X., Riquelme, C., Minderer, M., Puigcerver, J., Evci, U., Kumar, M., van Steenkiste, S., Elsayed, G.F., Mahendran, A., Yu, F., Oliver, A., Huot, F., Bastings, J., Col...

  11. [11]

    Doshi, F.R., Fel, T., Konkle, T., Alvarez, G.: Bi-orthogonal factor decomposition for vision transformers (2026),https://arxiv.org/abs/2601.05328

  12. [12]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  13. [13]

    arXiv preprint arXiv:2404.16818 (2024)

    Hahn, O., Araslanov, N., Schaub-Meyer, S., Roth, S.: Boosting unsuper- vised semantic segmentation with principal mask proposals. arXiv preprint arXiv:2404.16818 (2024)

  14. [14]

    arXiv preprint arXiv:2203.08414 (2022)

    Hamilton, M., Zhang, Z., Hariharan, B., Snavely, N., Freeman, W.T.: Unsuper- vised semantic segmentation by distilling feature correspondences. arXiv preprint arXiv:2203.08414 (2022)

  15. [15]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

  16. [16]

    Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., et al.: Openclip (2021)

  17. [17]

    Li, Y.,Salehi,S.,Ungar,L.,Kording,K.P.:Doesobjectbinding naturallyemergein large pretrained vision transformers? (2026),https://arxiv.org/abs/2510.24709

  18. [18]

    Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer:Hierarchicalvisiontransformerusingshiftedwindows.In:Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021)

  19. [19]

    In: Proc

    McQueen, J.B.: Some methods of classification and analysis of multivariate ob- servations. In: Proc. of 5th Berkeley Symposium on Math. Stat. and Prob. pp. 281–297 (1967) Unsupervised Semantic Segmentation Facilitates Model Understanding 19

  20. [20]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  21. [21]

    Park, N., Kim, W., Heo, B., Kim, T., Yun, S.: What do self-supervised vision transformers learn? (2023),https://arxiv.org/abs/2305.00729

  22. [22]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Seong, H.S., Moon, W., Lee, S., Heo, J.P.: Leveraging hidden positives for unsu- pervised semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 19540–19549 (2023)

  23. [23]

    The Bell System Tech- nical Journal27, 379–423 (1948),http://plan9.bell- labs.com/cm/ms/what/ shannonday/shannon1948.pdf

    Shannon, C.E.: A mathematical theory of communication. The Bell System Tech- nical Journal27, 379–423 (1948),http://plan9.bell- labs.com/cm/ms/what/ shannonday/shannon1948.pdf

  24. [24]

    DINOv3

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

  25. [25]

    Neurocomputing568, 127063 (2024)

    Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced trans- former with rotary position embedding. Neurocomputing568, 127063 (2024)

  26. [26]

    Walmer, M., Suri, S., Gupta, K., Shrivastava, A.: Teaching matters: Investigating the role of supervision in vision transformers (2023),https://arxiv.org/abs/ 2212.03862

  27. [27]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision transformer: A versatile backbone for dense prediction with- out convolutions. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 568–578 (2021)

  28. [28]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wang, X., Girdhar, R., Yu, S.X., Misra, I.: Cut and learn for unsupervised object detection and instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3124–3134 (2023)

  29. [29]

    IEEE transactions on pattern analysis and machine intelligence45(12), 15790–15801 (2023)

    Wang, Y., Shen, X., Yuan, Y., Du, Y., Li, M., Hu, S.X., Crowley, J.L., Vaufreydaz, D.: Tokencut: Segmenting objects in images and videos with self-supervised trans- former and normalized cut. IEEE transactions on pattern analysis and machine intelligence45(12), 15790–15801 (2023)

  30. [30]

    arXiv preprint arXiv:2502.10385 (2025)

    Wu, Z., Zhang, J., Pai, D., Wang, X., Singh, C., Yang, J., Gao, J., Ma, Y.: Simpli- fying dino via coding rate regularization. arXiv preprint arXiv:2502.10385 (2025)

  31. [31]

    Black, and Otmar Hilliges

    Xie, Z., Geng, Z., Hu, J., Zhang, Z., Hu, H., Cao, Y.: Revealing the dark secrets of masked image modeling. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14475–14485 (2023).https://doi.org/10. 1109/CVPR52729.2023.01391

  32. [32]

    In: European Conference on Computer Vision

    Yang, J., Luo, K.Z., Li, J., Deng, C., Guibas, L., Krishnan, D., Weinberger, K.Q., Tian, Y., Wang, Y.: Denoising vision transformers. In: European Conference on Computer Vision. pp. 453–469. Springer (2024)

  33. [33]

    iBOT: Image BERT Pre-Training with Online Tokenizer

    Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A., Kong, T.: ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832 (2021)

  34. [34]

    arXiv preprint arXiv:2203.14415 (2022)

    Zhou, P., Zhou, Y., Si, C., Yu, W., Ng, T.K., Yan, S.: Mugs: A multi-granular self-supervised learning framework. arXiv preprint arXiv:2203.14415 (2022)

  35. [35]

    Gram anchoring

    Zhou, T., Xia, W., Zhang, F., Chang, B., Wang, W., Yuan, Y., Konukoglu, E., Cremers,D.:Imagesegmentationinfoundationmodelera:Asurvey.arXivpreprint arXiv:2408.12957 (2024) 20 X. Yu et al. A Supplement A.1 Properties of models included in the benchmark SSL framework Size Dataset Pos. Emb. Category Supervised ViT [12] B,L ImageNet-21K Learned abs Supervised ...