Learning Subspace-Preserving Sparse Attention Graphs from Heterogeneous Multiview Data

Chuanbin Liu; Jie Chen; Xi Peng; Yuanbiao Gou; Zhu Wang

arxiv: 2605.11881 · v2 · pith:AQHORQN4new · submitted 2026-05-12 · 💻 cs.CV

Learning Subspace-Preserving Sparse Attention Graphs from Heterogeneous Multiview Data

Jie Chen , Yuanbiao Gou , Chuanbin Liu , Zhu Wang , Xi Peng This is my paper

Pith reviewed 2026-05-20 22:43 UTC · model grok-4.3

classification 💻 cs.CV

keywords sparse attention graphsheterogeneous multiview datasubspace preservationunsupervised transfer learningbilinear attentionalpha-entmaxgraph learning

0 comments

The pith

SAGL learns subspace-preserving sparse attention graphs from heterogeneous multiview data using bilinear factorization and alpha-entmax.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to construct sparse similarity graphs from high-dimensional features produced by different pretrained models. These graphs are designed to preserve the underlying subspace structures present in the data. This matters because current unsupervised transfer learning techniques often fail to fully exploit complementary information across views for semantic alignment. By addressing this, the approach aims to produce better representations for downstream multiview tasks.

Core claim

We propose SAGL, which uses a bilinear attention factorization to capture asymmetric similarities among features, a dynamic sparsity gating to adaptively control neighbor contributions via a feature-specific compression factor, and α-entmax for generating subspace-preserving sparse attention graphs per view. These graphs then support sparse information aggregation to yield discriminative representations. A theoretical analysis connects differentiable sparse attention to probability simplex constraints.

What carries the argument

Bilinear attention factorization scheme with dynamic sparsity gating and α-entmax structured sparse projection, which breaks symmetry in similarities and enforces subspace preservation in the learned graphs.

If this is right

The learned graphs enable effective sparse information aggregation across views.
Discriminative representations are produced for various multiview learning tasks.
Theoretical guarantees link the sparse attention mechanism to simplex constraints.
The method achieves superior performance over state-of-the-art approaches on benchmark datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This framework may extend to other modalities like text or audio where multiple pretrained models provide views.
Breaking the symmetry bottleneck could inspire similar techniques in graph neural networks for non-symmetric relations.
The dynamic gating might help in scenarios with noisy or varying quality features from different models.

Load-bearing premise

The bilinear attention factorization combined with dynamic sparsity gating and α-entmax projection faithfully recovers the intrinsic subspace structures from the high-dimensional heterogeneous multiview features.

What would settle it

Evaluating the method on synthetic multiview data with explicitly defined subspaces and measuring whether the learned graphs exhibit higher fidelity to those subspaces than competing methods would falsify or support the claim.

Figures

Figures reproduced from arXiv: 2605.11881 by Chuanbin Liu, Jie Chen, Xi Peng, Yuanbiao Gou, Zhu Wang.

**Figure 1.** Figure 1: A comparison of performance between SAGL and the zeroshot transfer learning baselines. 2) Zero-Shot Transfer Learning: The zero-shot transfer learning baselines, such as CLIP zero-shot transfer (ZST), LaFTer, and SAOT, utilize descriptions of ground-truth classes as a form of supervision. CLIP ZST and LaFTer employ CLIP ViT-L/14 and ViT-B/32 backbones, respectively, while SAOT employs CLIP ViT-B/16 and D… view at source ↗

**Figure 2.** Figure 2: Training times (in seconds) of TURTLE, MSRL, and SAGL on eight vision datasets. F. Experimental Details The statistics of the datasets are summarized in Table V. All experiments are performed on a Linux workstation equipped with a GeForce RTX 4090 GPU (24 GB memory), an Intel Xeon Platinum 8336C CPU, and 128 GB of RAM. SAGL is implemented in PyTorch [30]. 1) Parameter Settings: During both training and t… view at source ↗

**Figure 3.** Figure 3: The t-SNE visualization of three levels of features generated by SAGL on the Pets dataset, where the original features are extracted using SigLIP 2. (a) The original features (b) The linear features (c) The corresponding representations [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: The t-SNE visualization of three levels of features generated by SAGL on the Pets dataset, where the original features are extracted using DINOv3. (a) The original features (b) The linear features (c) The corresponding representations [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: The t-SNE visualization of three levels of features generated by SAGL on the SUN397 dataset, where the original features are extracted using SigLIP 2 [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: The t-SNE visualization of three levels of features generated by SAGL on the SUN397 dataset, where the original features are extracted using DINOv3. (a) The original features (b) The linear features (c) The reconstructed representations [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: The t-SNE visualization of three levels of features generated by SAGL on the Food101 dataset, where the original features are extracted using SigLIP 2. (a) The original features (b) The linear features (c) The reconstructed representations [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: The t-SNE visualization of three levels of features generated by SAGL on the Food101 dataset, where the original features are extracted using DINOv3 [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Sparsity ratio evolution of sparse attention graphs during training. (a) Caltech101 View 1 (b) Caltech101 View 2 (c) Food101 View 1 (d) Food101 View 2 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Block-diagonal structures of the learned sparse attention graphs on the Caltech101 and Food101 datasets. (a) ACC (b) NMI (c) ARI [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Clustering results of SAGL on the Pets dataset across different batch sizes. (a) ACC (b) NMI (c) ARI [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Clustering results of SAGL on the SUN397 dataset across different batch sizes [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Clustering results of SAGL on the Food101 dataset across different batch sizes. (a) ACC (b) NMI (c) ARI [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: The clustering results of SAGL under different γ and β combinations on the Pets dataset. (a) ACC (b) NMI (c) ARI [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

**Figure 15.** Figure 15: The clustering results of SAGL under different γ and β combinations on the SUN397 dataset [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗

**Figure 16.** Figure 16: The clustering results of SAGL under different γ and β combinations on the Food101 dataset. (a) Pets (b) KITTI (c) Flowers (d) Caltech101 (e) EuroSAT (f) SUN397 (g) Food101 (h) ImageNet-1K [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗

**Figure 17.** Figure 17: Convergence results of Algorithm 1 on all the datasets. for improved clustering performance. 5) Sparsity Analysis on Sparse Attention Graphs: We first analyze the sparsity of the learned attention graphs during training. The sparsity ratio (SR) is defined as the number of nonzero elements in A(l) divided by the total number of elements. Specifically, we examine the sparsity ratios of sparse attention grap… view at source ↗

read the original abstract

The high-dimensional features extracted from large-scale unlabeled data via various pretrained models with diverse architectures are referred to as heterogeneous multiview data. Most existing unsupervised transfer learning methods fail to faithfully recover intrinsic subspace structures when exploiting complementary information across multiple views. Therefore, a fundamental challenge involves constructing sparse similarity graphs that preserve these underlying subspace structures for achieving semantic alignment across heterogeneous views. In this paper, we propose a sparse attention graph learning (SAGL) method that learns subspace-preserving sparse attention graphs from heterogeneous multiview data. Specifically, we introduce a bilinear attention factorization scheme to capture asymmetric similarities among the high-dimensional features, which breaks the symmetry bottleneck that is inherent in the traditional representation learning techniques. A dynamic sparsity gating mechanism then predicts a feature-specific compression factor for adaptively controlling the topological contributions of neighbors. Furthermore, we employ a structured sparse projection via $\alpha$-entmax to generate subspace-preserving sparse attention graphs for individual views. SAGL leverages these view-specific graphs to conduct sparse information aggregation, yielding discriminative representations for multiview learning tasks. In addition, we provide a rigorous theoretical analysis that bridges differentiable sparse attention and probability simplex constraints. Extensive experiments conducted on multiple benchmark datasets demonstrate that SAGL consistently outperforms the state-of-the-art unsupervised transfer learning approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAGL combines bilinear factorization with α-entmax sparsity for multiview graphs and reports benchmark gains, but the subspace recovery claim under heterogeneous pretrained features rests more on construction than on derived conditions.

read the letter

The paper introduces SAGL, which uses bilinear attention factorization to handle asymmetric similarities across views, adds dynamic sparsity gating to pick a per-feature compression factor, and applies α-entmax projection to produce sparse graphs that are meant to stay subspace-preserving. Those pieces are stitched together to aggregate information and produce representations for unsupervised transfer tasks. The experiments show the method beating prior approaches on several standard multiview benchmarks, which is the clearest practical signal here. The dynamic gating and the shift to asymmetric similarities are the parts that feel like genuine additions rather than simple re-labeling of older attention tricks. The theoretical section claims to bridge sparse attention with simplex constraints, and that framing is at least coherent on its own terms. The main limitation is that the analysis does not appear to supply explicit conditions on view heterogeneity or pretrained-model shift under which the recovered graphs actually match intrinsic subspaces. Without those conditions or a clear counter-example study, the subspace-preservation property stays tied to the modeling choices rather than following from the math in a verifiable way. The experiments would need to include stronger ablations on the new components to show how much they drive the reported gains versus the baseline attention machinery. This work is aimed at people already working on multiview graph methods in computer vision who need a concrete way to handle features from multiple pretrained extractors. It is worth sending for peer review because the method is testable and the empirical claims can be checked directly, even if the theory section would benefit from tighter bounds.

Referee Report

2 major / 2 minor

Summary. The paper proposes SAGL, a sparse attention graph learning method for constructing subspace-preserving sparse attention graphs from heterogeneous multiview data extracted via diverse pretrained models. It introduces bilinear attention factorization to model asymmetric similarities, a dynamic sparsity gating mechanism that predicts feature-specific compression factors, and an α-entmax structured sparse projection to enforce subspace-preserving graphs per view. These graphs enable sparse information aggregation for discriminative multiview representations. The work includes a claimed rigorous theoretical analysis bridging differentiable sparse attention with probability simplex constraints and reports consistent outperformance over state-of-the-art unsupervised transfer learning methods on multiple benchmark datasets.

Significance. If the central claims hold, the work could advance unsupervised multiview and transfer learning by providing a principled mechanism to recover intrinsic subspace structures from high-dimensional heterogeneous features. The bilinear factorization and α-entmax components offer a novel link between attention mechanisms and graph-based subspace preservation, with potential for broader application in semantic alignment tasks. Empirical outperformance is noted as a strength, though significance hinges on verifying the theoretical guarantees for subspace fidelity under pretrained-model distribution shifts.

major comments (2)

[Theoretical Analysis] Theoretical Analysis section: The claimed rigorous bridge between differentiable sparse attention and probability simplex constraints does not derive or state explicit conditions (e.g., bounds on feature dimensionality, view heterogeneity, or pretrained feature distribution shift) under which the resulting graphs provably recover intrinsic subspaces rather than merely satisfying simplex membership. This is load-bearing for the central claim of faithful subspace structure recovery.
[Section 3.2 and 3.3] Section 3.2 (Bilinear Attention Factorization) and Section 3.3 (Dynamic Sparsity Gating): The construction is presented as breaking symmetry bottlenecks and adaptively controlling topology, but the manuscript provides no derivation or counter-example analysis showing that the combination with α-entmax guarantees subspace preservation (as opposed to generic sparsity) when input features come from heterogeneous pretrained models with potential distribution shift.

minor comments (2)

[Abstract and Introduction] The abstract and introduction are information-dense; consider adding a short overview paragraph or diagram that explicitly maps the three proposed components to the claimed theoretical and empirical contributions.
[Method sections] Notation for the feature-specific compression factor and the α-entmax projection could be clarified with an explicit equation reference in the main text to improve readability for readers unfamiliar with entmax variants.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below with clarifications and indicate planned revisions to strengthen the presentation of the theoretical and methodological contributions.

read point-by-point responses

Referee: [Theoretical Analysis] Theoretical Analysis section: The claimed rigorous bridge between differentiable sparse attention and probability simplex constraints does not derive or state explicit conditions (e.g., bounds on feature dimensionality, view heterogeneity, or pretrained feature distribution shift) under which the resulting graphs provably recover intrinsic subspaces rather than merely satisfying simplex membership. This is load-bearing for the central claim of faithful subspace structure recovery.

Authors: We appreciate the referee's point that the theoretical analysis centers on establishing the connection to simplex constraints via the differentiable α-entmax projection but stops short of deriving explicit bounds or conditions guaranteeing intrinsic subspace recovery under arbitrary feature dimensionality, view heterogeneity, or pretrained-model distribution shifts. The analysis demonstrates that the projection enforces non-negativity and summation to one, which aligns with the convex-combination property used in subspace clustering. In the revised manuscript we will expand the theoretical section with an additional remark clarifying the modeling assumptions (e.g., that input features approximately lie in a union of subspaces) under which simplex membership supports subspace preservation, while explicitly acknowledging the absence of worst-case bounds for strong distribution shifts. This addition will also reference relevant subspace-clustering literature to contextualize the scope of the guarantees. revision: partial
Referee: [Section 3.2 and 3.3] Section 3.2 (Bilinear Attention Factorization) and Section 3.3 (Dynamic Sparsity Gating): The construction is presented as breaking symmetry bottlenecks and adaptively controlling topology, but the manuscript provides no derivation or counter-example analysis showing that the combination with α-entmax guarantees subspace preservation (as opposed to generic sparsity) when input features come from heterogeneous pretrained models with potential distribution shift.

Authors: We agree that Sections 3.2 and 3.3 describe the bilinear factorization (to capture asymmetric similarities) and dynamic gating (to predict per-feature compression) without a self-contained derivation or counter-example study proving that their combination with α-entmax yields subspace-preserving graphs rather than merely sparse simplex vectors, especially under distribution shifts from heterogeneous pretrained models. The design choices are motivated by the need to relax symmetry and adapt sparsity to local feature statistics, with α-entmax supplying the structured sparsity. In the revision we will insert a short proposition in Section 3 that formally links the three components to subspace preservation under the assumption that the pretrained features are approximately subspace-structured, and we will add a brief discussion of robustness to moderate shifts as observed in the experiments. A full counter-example analysis across all possible shifts is beyond the current scope but will be noted as a limitation. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on novel constructions and direct projection properties

full rationale

The paper proposes SAGL via three new mechanisms—bilinear attention factorization for asymmetric similarities, dynamic sparsity gating for feature-specific compression, and α-entmax structured sparse projection—then aggregates the resulting view-specific graphs. The claimed 'rigorous theoretical analysis that bridges differentiable sparse attention and probability simplex constraints' follows directly from the known properties of the entmax operator enforcing simplex membership; this is a definitional consequence of the chosen projection rather than a reduction of the subspace-recovery claim to fitted parameters or prior self-citations. No equations or steps in the provided description equate a prediction or uniqueness result to its own inputs by construction, and the central subspace-preservation claim is presented as an empirical modeling outcome validated on benchmarks rather than a tautology. The derivation chain therefore remains self-contained against external data.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that heterogeneous multiview features contain recoverable intrinsic subspace structures and that the proposed mechanisms can preserve them without post-hoc tuning details provided.

free parameters (2)

alpha for entmax
Parameter controlling the structured sparse projection; likely requires selection or fitting though not explicitly detailed in abstract.
feature-specific compression factor
Predicted by the dynamic sparsity gating mechanism and controls topological contributions.

axioms (1)

domain assumption High-dimensional features from diverse pretrained models contain intrinsic subspace structures that can be recovered via sparse similarity graphs.
Invoked in the problem formulation and motivation for constructing subspace-preserving graphs.

pith-pipeline@v0.9.0 · 5757 in / 1194 out tokens · 35174 ms · 2026-05-20T22:43:00.494059+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We employ a structured sparse projection via α-entmax to generate subspace-preserving sparse attention graphs... rigorous theoretical analysis that bridges differentiable sparse attention and probability simplex constraints.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages

[1]

MIM-Refiner: A contrastive learning boost from intermediate pre-trained representations,

B. Alkin, L. Miklautz, S. Hochreiter, and J. Brandstetter, “MIM-Refiner: A contrastive learning boost from intermediate pre-trained representations,” in Proc. 13th Int. Conf. Learn. Represent., Singapore, Apr. 2025, pp. 1–37

work page 2025
[2]

Image clustering via the principle of rate reduction in the age of pretrained models,

T. Chu, S. Tong, T. Ding, X. Dai, B. D. Haeffele, R. Vidal, and Y. Ma, “Image clustering via the principle of rate reduction in the age of pretrained models,” in Proc. 12th Int. Conf. Learn. Represent., Vienna, Austria, May 2024, pp. 1–12

work page 2024
[3]

l0-motivated low-rank sparse sub- space clustering,

M. Brbić and I. Kopriva, “ l0-motivated low-rank sparse sub- space clustering,” IEEE Trans. on Cyber., vol. 50, no. 4, pp. 1711–1725, 2020

work page 2020
[4]

A general representation-based approach to multi-source domain adaptation,

I. Ng, Y. Li, Z. Li, Y. Zheng, G. Chen, and K. Zhang, “A general representation-based approach to multi-source domain adaptation,” in Proc. 42nd Int. Conf. Mach. Learn., Vancouver, Canada, Jul. 2025, pp. 45 911–45 933

work page 2025
[5]

En- hancing foundation models with federated domain knowledge infusion,

J. Wang, J. Li, W. Zhuang, C. Chen, L. Lyu, and F. Ma, “En- hancing foundation models with federated domain knowledge infusion,” in Proc. 42nd Int. Conf. Mach. Learn., Vancouver, Canada, Jul. 2025, pp. 63 621–63 635

work page 2025
[6]

Understanding transfer- able representation learning and zero-shot transfer in CLIP,

Z. Chen, Y. Deng, Y. Li, and Q. Gu, “Understanding transfer- able representation learning and zero-shot transfer in CLIP,” in Proc. 12th Int. Conf. Learn. Represent., Vienna, Austria, May 2024, pp. 1–12

work page 2024
[7]

Multiview self- representation learning across heterogeneous views,

J. Chen, Z. Wang, C. Liu, and X. Peng, “Multiview self- representation learning across heterogeneous views,” arXiv preprint arXiv:2602.04328, pp. 1–12, Jan. 2026

work page arXiv 2026
[8]

Heterogeneous graph structure learning for experts selection in academic evaluation,

C. Liu, R. Bing, X. Xi, W. Dai, and G. Yuan, “Heterogeneous graph structure learning for experts selection in academic evaluation,” IEEE Trans. Comput. Soc. Syst., vol. 12, no. 6, pp. 4677–4688, 2025

work page 2025
[9]

Let go of your labels with unsupervised transfer,

A. Gadetsky, Y. Jiang, and M. Brbic, “Let go of your labels with unsupervised transfer,” in Proc. 41st Int. Conf. Mach. Learn., Vienna, Austria, Jul. 2024, pp. 14 382–14 407

work page 2024
[10]

DTL: Parameter- and memory-eﬀicient disentangled vision learning,

M. Fu, K. Zhu, Z. Ding, and J. Wu, “DTL: Parameter- and memory-eﬀicient disentangled vision learning,” IEEE Trans. Pattern Anal. and Mach. Intell., vol. 48, no. 2, pp. 1736–1749, Feb. 2026

work page 2026
[11]

PRO-VPT: Distribution-adaptive visual prompt tuning via prompt relocation,

C. Shang, M. Li, Y. Zhang, Z. Chen, J. Wu, F. Gu, Y. Lu, and Y. Cheung, “PRO-VPT: Distribution-adaptive visual prompt tuning via prompt relocation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., Hawaii, USA, Oct. 2025, pp. 1558–1568

work page 2025
[12]

Dual self- supervised deep graph clustering,

Q. Wang, B. Zhao, Z. Zhang, Q. Gao, and L. Jiao, “Dual self- supervised deep graph clustering,” IEEE Trans. Multi., pp. 1– 10, Jan. 2026

work page 2026
[13]

Pseudo-label similarity graph-driven multi-view contrastive clustering,

G. Li, Z. Yu, K. Yang, J. Lv, and C. L. P. Chen, “Pseudo-label similarity graph-driven multi-view contrastive clustering,” IEEE Trans. Multi., pp. 1–13, Feb. 2026

work page 2026
[14]

Collaborative similarity fusion and consistency recovery for incomplete multi-view clustering,

B. Jiang, C. Zhang, X. Liang, P. Zhou, J. Yang, X. Wu, J. Guan, W. Ding, and W. Sheng, “Collaborative similarity fusion and consistency recovery for incomplete multi-view clustering,” in Proc. AAAI Conf. Artif. Intell., vol. 39, no. 17, Philadelphia, Pennsylvania, USA, Feb. 2025, pp. 17 617–17 625

work page 2025
[15]

One-step adaptive graph learning for incomplete multiview subspace clustering,

J. Chen, H. Mao, W. L. Woo, C. Liu, Z. Wang, and X. Peng, “One-step adaptive graph learning for incomplete multiview subspace clustering,” IEEE Trans. Knowl. Data Eng., vol. 37, no. 5, pp. 2771–2783, May 2025

work page 2025
[16]

Adaptive anchor- guided representation learning for eﬀicient multi-view subspace clustering,

M. Zhang, X. Liu, T. Han, X. Qu, and S. Niu, “Adaptive anchor- guided representation learning for eﬀicient multi-view subspace clustering,” IEEE Trans. Image Process., vol. 34, pp. 6053–6067, Sept. 2025

work page 2025
[17]

Deep multi- view contrastive clustering via graph structure awareness,

L. Fei, J. He, Q. Zhu, S. Zhao, J. Wen, and Y. Xu, “Deep multi- view contrastive clustering via graph structure awareness,” IEEE Trans. Image Process., vol. 34, pp. 3805–3816, Jun. 2025

work page 2025
[18]

THESAURUS: contrastive graph clustering by swapping fused gromov-wasserstein couplings,

B. Deng, T. Wang, L. Fu, S. Huang, C. Chen, and T. Zhang, “THESAURUS: contrastive graph clustering by swapping fused gromov-wasserstein couplings,” in Proc. AAAI Conf. Artif. Intell., Philadelphia, Pennsylvania, USA, Feb. 2025, pp. 16 199– 16 207

work page 2025
[19]

LMCBert: An automatic academic paper rating model based on large lan- guage models and contrastive learning,

C. Liu, X. Zhang, H. Zhao, Z. Liu, X. Xi, and L. Yu, “LMCBert: An automatic academic paper rating model based on large lan- guage models and contrastive learning,” IEEE Trans. Cybern., vol. 55, no. 6, pp. 2970–2979, 2025

work page 2025
[20]

Sparse subspace clustering algo- rithm, theory, and applications,

E. Elhamifar and R. Vidal, “Sparse subspace clustering algo- rithm, theory, and applications,” IEEE Trans. Pattern Anal. and Mach. Intell., vol. 35, no. 11, pp. 2765–2781, 2013

work page 2013
[21]

Sparse sequence- to-sequence models,

B. Peters, V. Niculae, and A. F. T. Martins, “Sparse sequence- to-sequence models,” in Proc. 57th Annu. Meet. Assoc. Comput. Linguist., Florence, Italy, Jul. 2019, pp. 1504–1519

work page 2019
[22]

From softmax to sparse- max: A sparse model of attention and multi-label classification,

A. F. F. Martins and R. F. Astudillo, “From softmax to sparse- max: A sparse model of attention and multi-label classification,” in Proc. 33rd Int. Conf. Mach. Learn., New York, USA, Jun. 2016, pp. 1614–1623

work page 2016
[23]

Hierarchical sparse representation clustering for high-dimensional data streams,

J. Chen, H. Mao, Y. Gou, and X. Peng, “Hierarchical sparse representation clustering for high-dimensional data streams,” IEEE Trans. Neural. Netw. Learn. Syst., vol. 36, no. 10, pp. 18 035–18 047, Oct. 2025

work page 2025
[24]

Learnable multi-view matrix factorization with graph embedding and flexible loss,

S. Huang, Y. Zhang, L. Fu, and S. Wang, “Learnable multi-view matrix factorization with graph embedding and flexible loss,” IEEE Trans. on Multi., vol. 25, pp. 3259–3272, 2022

work page 2022
[25]

Unified low-rank tensor learning and spectral embedding for multi-view subspace clustering,

L. Fu, Z. Chen, Y. Chen, and S. Wang, “Unified low-rank tensor learning and spectral embedding for multi-view subspace clustering,” IEEE Trans. on Multi., vol. 25, pp. 4972–4985, 2022

work page 2022
[26]

Partial multiview incomplete multilabel learning via uncertainty-driven reliable dynamic fusion,

J. Wen, J. Long, X. Lu, C. Liu, X. Fang, and Y. Xu, “Partial multiview incomplete multilabel learning via uncertainty-driven reliable dynamic fusion,” IEEE Trans. Pattern Anal. and Mach. Intell., vol. 48, no. 1, pp. 236–250, Jan. 2026

work page 2026
[27]

Multi-view clustering with granularity-aware pseudo supervision,

J. Yang, C. Y. Lu, Z. Wang, H. T. Chen, G. K. Xu, C. Zhang, S. Dong, X. Liang, and B. Jiang, “Multi-view clustering with granularity-aware pseudo supervision,” in Proc. AAAI Conf. Artif. Intell., vol. 40, no. 19, Singapore, Jan. 2026, pp. 27 538– 27 546

work page 2026
[28]

Bridging optimization and neural networks for eﬀicient multi-view clus- tering,

H. Xu, X. Su, S. Chen, G. Chen, and X. Chen, “Bridging optimization and neural networks for eﬀicient multi-view clus- tering,” in Proc. AAAI Conf. Artif. Intell., vol. 40, no. 19, Singapore, Jan. 2026, pp. 16 066–16 074

work page 2026
[29]

Expert credi- bility prediction model based on fuzzy C-means clustering and similarity association,

C. Liu, J. Guo, X. Zhang, D. Wu, and L. Yu, “Expert credi- bility prediction model based on fuzzy C-means clustering and similarity association,” IEEE Trans. on Cyber., vol. 33, no. 8, pp. 2719–2729, 2025

work page 2025
[30]

Pytorch: an imperative style, high- performance deep learning library,

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, and et.al, “Pytorch: an imperative style, high- performance deep learning library,” in Proc. 33rd Adv. Neural Inf. Process. Syst., Vancouver, BC, Canada, Dec. 2019, pp. 8026–8037

work page 2019
[31]

LaFTer: Label–free tuning of zero- shot classifier using language and unlabeled image collections,

M. J. Mirza, L. Karlinsky, W. Lin, H. Possegger, M. K. R. Feris, and H. Bischof, “LaFTer: Label–free tuning of zero- shot classifier using language and unlabeled image collections,” in Proc. 37th Adv. Neural Inf. Process. Syst., New Orleans, Louisiana, USA, Dec. 2023, pp. 5765–5777

work page 2023
[32]

Sim’eoni, H

O. Sim’eoni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J’egou, P. Labatut, and P. Bojanowski, “DINOv3,” pp. 1–67, 2025

work page 2025
[33]

SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features,

M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Al- abdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y. Xia, B. Mustafa, O. Hénaff, J. Harmsen, A. Steiner, and X. Zhai, “SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features,” pp. 1–20, 2025

work page 2025
[34]

ADAM: a method for stochastic optimization,

D. P. Kingma and J. L. Ba, “ADAM: a method for stochastic optimization,” in Proc. 3rd Int. Conf. Learn. Represent., San Diego, CA, USA, May 2015, pp. 1–15

work page 2015
[35]

Dis- tributed optimization and statistical learning via the alternating direction method of multipliers,

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Dis- tributed optimization and statistical learning via the alternating direction method of multipliers,” Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1–122, Mar. 2011

work page 2011
[36]

Similarity of neural network representations revisited,

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton, “Similarity of neural network representations revisited,” in Proc. 36th Int. Conf. Mach. Learn., California, USA, Jun. 2019, pp. 3519–3529

work page 2019
[37]

Visualizing data using t- SNE,

L. van der Maaten and G. Hinton, “Visualizing data using t- SNE,” J. Mach. Learn. Res, vol. 9, no. 11, pp. 2579–2605, 2008. Algorithm 1 Optimization Procedure for SAGL Input: An unlabeled training dataset Dtr = {x1, x2, . . . ,xn} and a testing dataset Dts = {ˆx1, ˆx2, . . . ,ˆxˆn} belonging to C categories, and L pre-trained models. Parameters: The numbe...

work page 2008
[38]

The batch size for both training and testing is selected from the set {100, 500, 1,000, 5,000, 10,000}

Parameter Settings: During both training and test- ing, the learning rate for the proposed SAGL model is empirically set to 5 × 10−4 for the KITTI, Flowers, Food101 and ImageNet-1K datasets and to 1×10−3 for all other datasets. The batch size for both training and testing is selected from the set {100, 500, 1,000, 5,000, 10,000}. Specifically, the batch s...

work page
[39]

We adopt centered kernel alignment (CKA) [ 36] to measure the simi- larity between feature distributions produced by different pretrained model pairs

Measuring Similarity Across Heterogeneous Views: Different backbones exhibit different representation levels: transformer-based models (e.g., DINOv3, SigLIP 2 and CLIP ViT-L/14) typically produce global semantic repre- sentations, while convolutional models (e.g., ConvNeXt V2) capture more localized spatial features. We adopt centered kernel alignment (CK...

work page
[40]

For fair comparison, we report the computational cost of the competing methods that utilize two pretrained backbones

Comparison of Training Times for Self-Supervised Learning: To evaluate the training eﬀiciency of the pro- posed SAGL method, we compare the computational costs of TURTLE, MSRL, and SAGL on the training sets of all eight datasets. For fair comparison, we report the computational cost of the competing methods that utilize two pretrained backbones. Fig. 2 sh...

work page
[41]

Visualizations: To evaluate the learned representa- tions, we employ t-SNE [ 37] to visualize three levels of features on three representative datasets of varying scales: Pets, Caltech101, and Food101. Specifically, the three levels are: (1) the original features extracted from the two pretrained backbones, (2) the projected features after the linear tran...

work page
[42]

The sparsity ratio (SR) is defined as the number of nonzero elements in A(l) divided by the total number of elements

Sparsity Analysis on Sparse Attention Graphs: We first analyze the sparsity of the learned attention graphs during training. The sparsity ratio (SR) is defined as the number of nonzero elements in A(l) divided by the total number of elements. Specifically, we examine the sparsity ratios of sparse attention graphs on the two representative datasets, Caltec...

work page
[43]

Consequently, the batch size plays an important role in determining the quality of the learned sparse attention graphs

Parameter Sensitivity Analysis: By exploiting the sparse self-representation property of features, each rep- resentation is constructed as a sparse linear combination of spatially proximate neighbors. Consequently, the batch size plays an important role in determining the quality of the learned sparse attention graphs. To investigate the sensitivity of SA...

work page
[44]

Convergence Analysis: We empirically evaluate the convergence property of the proposed method across all eight datasets. Fig. 17 shows the convergence curves of Algorithm 1, where the x-axis corresponds to iterations, and the y-axis represents the objective loss defined in Eq. ( 21). A positive constant is added to the y-axis values for better readability...

work page

[1] [1]

MIM-Refiner: A contrastive learning boost from intermediate pre-trained representations,

B. Alkin, L. Miklautz, S. Hochreiter, and J. Brandstetter, “MIM-Refiner: A contrastive learning boost from intermediate pre-trained representations,” in Proc. 13th Int. Conf. Learn. Represent., Singapore, Apr. 2025, pp. 1–37

work page 2025

[2] [2]

Image clustering via the principle of rate reduction in the age of pretrained models,

T. Chu, S. Tong, T. Ding, X. Dai, B. D. Haeffele, R. Vidal, and Y. Ma, “Image clustering via the principle of rate reduction in the age of pretrained models,” in Proc. 12th Int. Conf. Learn. Represent., Vienna, Austria, May 2024, pp. 1–12

work page 2024

[3] [3]

l0-motivated low-rank sparse sub- space clustering,

M. Brbić and I. Kopriva, “ l0-motivated low-rank sparse sub- space clustering,” IEEE Trans. on Cyber., vol. 50, no. 4, pp. 1711–1725, 2020

work page 2020

[4] [4]

A general representation-based approach to multi-source domain adaptation,

I. Ng, Y. Li, Z. Li, Y. Zheng, G. Chen, and K. Zhang, “A general representation-based approach to multi-source domain adaptation,” in Proc. 42nd Int. Conf. Mach. Learn., Vancouver, Canada, Jul. 2025, pp. 45 911–45 933

work page 2025

[5] [5]

En- hancing foundation models with federated domain knowledge infusion,

J. Wang, J. Li, W. Zhuang, C. Chen, L. Lyu, and F. Ma, “En- hancing foundation models with federated domain knowledge infusion,” in Proc. 42nd Int. Conf. Mach. Learn., Vancouver, Canada, Jul. 2025, pp. 63 621–63 635

work page 2025

[6] [6]

Understanding transfer- able representation learning and zero-shot transfer in CLIP,

Z. Chen, Y. Deng, Y. Li, and Q. Gu, “Understanding transfer- able representation learning and zero-shot transfer in CLIP,” in Proc. 12th Int. Conf. Learn. Represent., Vienna, Austria, May 2024, pp. 1–12

work page 2024

[7] [7]

Multiview self- representation learning across heterogeneous views,

J. Chen, Z. Wang, C. Liu, and X. Peng, “Multiview self- representation learning across heterogeneous views,” arXiv preprint arXiv:2602.04328, pp. 1–12, Jan. 2026

work page arXiv 2026

[8] [8]

Heterogeneous graph structure learning for experts selection in academic evaluation,

C. Liu, R. Bing, X. Xi, W. Dai, and G. Yuan, “Heterogeneous graph structure learning for experts selection in academic evaluation,” IEEE Trans. Comput. Soc. Syst., vol. 12, no. 6, pp. 4677–4688, 2025

work page 2025

[9] [9]

Let go of your labels with unsupervised transfer,

A. Gadetsky, Y. Jiang, and M. Brbic, “Let go of your labels with unsupervised transfer,” in Proc. 41st Int. Conf. Mach. Learn., Vienna, Austria, Jul. 2024, pp. 14 382–14 407

work page 2024

[10] [10]

DTL: Parameter- and memory-eﬀicient disentangled vision learning,

M. Fu, K. Zhu, Z. Ding, and J. Wu, “DTL: Parameter- and memory-eﬀicient disentangled vision learning,” IEEE Trans. Pattern Anal. and Mach. Intell., vol. 48, no. 2, pp. 1736–1749, Feb. 2026

work page 2026

[11] [11]

PRO-VPT: Distribution-adaptive visual prompt tuning via prompt relocation,

C. Shang, M. Li, Y. Zhang, Z. Chen, J. Wu, F. Gu, Y. Lu, and Y. Cheung, “PRO-VPT: Distribution-adaptive visual prompt tuning via prompt relocation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., Hawaii, USA, Oct. 2025, pp. 1558–1568

work page 2025

[12] [12]

Dual self- supervised deep graph clustering,

Q. Wang, B. Zhao, Z. Zhang, Q. Gao, and L. Jiao, “Dual self- supervised deep graph clustering,” IEEE Trans. Multi., pp. 1– 10, Jan. 2026

work page 2026

[13] [13]

Pseudo-label similarity graph-driven multi-view contrastive clustering,

G. Li, Z. Yu, K. Yang, J. Lv, and C. L. P. Chen, “Pseudo-label similarity graph-driven multi-view contrastive clustering,” IEEE Trans. Multi., pp. 1–13, Feb. 2026

work page 2026

[14] [14]

Collaborative similarity fusion and consistency recovery for incomplete multi-view clustering,

B. Jiang, C. Zhang, X. Liang, P. Zhou, J. Yang, X. Wu, J. Guan, W. Ding, and W. Sheng, “Collaborative similarity fusion and consistency recovery for incomplete multi-view clustering,” in Proc. AAAI Conf. Artif. Intell., vol. 39, no. 17, Philadelphia, Pennsylvania, USA, Feb. 2025, pp. 17 617–17 625

work page 2025

[15] [15]

One-step adaptive graph learning for incomplete multiview subspace clustering,

J. Chen, H. Mao, W. L. Woo, C. Liu, Z. Wang, and X. Peng, “One-step adaptive graph learning for incomplete multiview subspace clustering,” IEEE Trans. Knowl. Data Eng., vol. 37, no. 5, pp. 2771–2783, May 2025

work page 2025

[16] [16]

Adaptive anchor- guided representation learning for eﬀicient multi-view subspace clustering,

M. Zhang, X. Liu, T. Han, X. Qu, and S. Niu, “Adaptive anchor- guided representation learning for eﬀicient multi-view subspace clustering,” IEEE Trans. Image Process., vol. 34, pp. 6053–6067, Sept. 2025

work page 2025

[17] [17]

Deep multi- view contrastive clustering via graph structure awareness,

L. Fei, J. He, Q. Zhu, S. Zhao, J. Wen, and Y. Xu, “Deep multi- view contrastive clustering via graph structure awareness,” IEEE Trans. Image Process., vol. 34, pp. 3805–3816, Jun. 2025

work page 2025

[18] [18]

THESAURUS: contrastive graph clustering by swapping fused gromov-wasserstein couplings,

B. Deng, T. Wang, L. Fu, S. Huang, C. Chen, and T. Zhang, “THESAURUS: contrastive graph clustering by swapping fused gromov-wasserstein couplings,” in Proc. AAAI Conf. Artif. Intell., Philadelphia, Pennsylvania, USA, Feb. 2025, pp. 16 199– 16 207

work page 2025

[19] [19]

LMCBert: An automatic academic paper rating model based on large lan- guage models and contrastive learning,

C. Liu, X. Zhang, H. Zhao, Z. Liu, X. Xi, and L. Yu, “LMCBert: An automatic academic paper rating model based on large lan- guage models and contrastive learning,” IEEE Trans. Cybern., vol. 55, no. 6, pp. 2970–2979, 2025

work page 2025

[20] [20]

Sparse subspace clustering algo- rithm, theory, and applications,

E. Elhamifar and R. Vidal, “Sparse subspace clustering algo- rithm, theory, and applications,” IEEE Trans. Pattern Anal. and Mach. Intell., vol. 35, no. 11, pp. 2765–2781, 2013

work page 2013

[21] [21]

Sparse sequence- to-sequence models,

B. Peters, V. Niculae, and A. F. T. Martins, “Sparse sequence- to-sequence models,” in Proc. 57th Annu. Meet. Assoc. Comput. Linguist., Florence, Italy, Jul. 2019, pp. 1504–1519

work page 2019

[22] [22]

From softmax to sparse- max: A sparse model of attention and multi-label classification,

A. F. F. Martins and R. F. Astudillo, “From softmax to sparse- max: A sparse model of attention and multi-label classification,” in Proc. 33rd Int. Conf. Mach. Learn., New York, USA, Jun. 2016, pp. 1614–1623

work page 2016

[23] [23]

Hierarchical sparse representation clustering for high-dimensional data streams,

J. Chen, H. Mao, Y. Gou, and X. Peng, “Hierarchical sparse representation clustering for high-dimensional data streams,” IEEE Trans. Neural. Netw. Learn. Syst., vol. 36, no. 10, pp. 18 035–18 047, Oct. 2025

work page 2025

[24] [24]

Learnable multi-view matrix factorization with graph embedding and flexible loss,

S. Huang, Y. Zhang, L. Fu, and S. Wang, “Learnable multi-view matrix factorization with graph embedding and flexible loss,” IEEE Trans. on Multi., vol. 25, pp. 3259–3272, 2022

work page 2022

[25] [25]

Unified low-rank tensor learning and spectral embedding for multi-view subspace clustering,

L. Fu, Z. Chen, Y. Chen, and S. Wang, “Unified low-rank tensor learning and spectral embedding for multi-view subspace clustering,” IEEE Trans. on Multi., vol. 25, pp. 4972–4985, 2022

work page 2022

[26] [26]

Partial multiview incomplete multilabel learning via uncertainty-driven reliable dynamic fusion,

J. Wen, J. Long, X. Lu, C. Liu, X. Fang, and Y. Xu, “Partial multiview incomplete multilabel learning via uncertainty-driven reliable dynamic fusion,” IEEE Trans. Pattern Anal. and Mach. Intell., vol. 48, no. 1, pp. 236–250, Jan. 2026

work page 2026

[27] [27]

Multi-view clustering with granularity-aware pseudo supervision,

J. Yang, C. Y. Lu, Z. Wang, H. T. Chen, G. K. Xu, C. Zhang, S. Dong, X. Liang, and B. Jiang, “Multi-view clustering with granularity-aware pseudo supervision,” in Proc. AAAI Conf. Artif. Intell., vol. 40, no. 19, Singapore, Jan. 2026, pp. 27 538– 27 546

work page 2026

[28] [28]

Bridging optimization and neural networks for eﬀicient multi-view clus- tering,

H. Xu, X. Su, S. Chen, G. Chen, and X. Chen, “Bridging optimization and neural networks for eﬀicient multi-view clus- tering,” in Proc. AAAI Conf. Artif. Intell., vol. 40, no. 19, Singapore, Jan. 2026, pp. 16 066–16 074

work page 2026

[29] [29]

Expert credi- bility prediction model based on fuzzy C-means clustering and similarity association,

C. Liu, J. Guo, X. Zhang, D. Wu, and L. Yu, “Expert credi- bility prediction model based on fuzzy C-means clustering and similarity association,” IEEE Trans. on Cyber., vol. 33, no. 8, pp. 2719–2729, 2025

work page 2025

[30] [30]

Pytorch: an imperative style, high- performance deep learning library,

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, and et.al, “Pytorch: an imperative style, high- performance deep learning library,” in Proc. 33rd Adv. Neural Inf. Process. Syst., Vancouver, BC, Canada, Dec. 2019, pp. 8026–8037

work page 2019

[31] [31]

LaFTer: Label–free tuning of zero- shot classifier using language and unlabeled image collections,

M. J. Mirza, L. Karlinsky, W. Lin, H. Possegger, M. K. R. Feris, and H. Bischof, “LaFTer: Label–free tuning of zero- shot classifier using language and unlabeled image collections,” in Proc. 37th Adv. Neural Inf. Process. Syst., New Orleans, Louisiana, USA, Dec. 2023, pp. 5765–5777

work page 2023

[32] [32]

Sim’eoni, H

O. Sim’eoni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J’egou, P. Labatut, and P. Bojanowski, “DINOv3,” pp. 1–67, 2025

work page 2025

[33] [33]

SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features,

M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Al- abdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y. Xia, B. Mustafa, O. Hénaff, J. Harmsen, A. Steiner, and X. Zhai, “SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features,” pp. 1–20, 2025

work page 2025

[34] [34]

ADAM: a method for stochastic optimization,

D. P. Kingma and J. L. Ba, “ADAM: a method for stochastic optimization,” in Proc. 3rd Int. Conf. Learn. Represent., San Diego, CA, USA, May 2015, pp. 1–15

work page 2015

[35] [35]

Dis- tributed optimization and statistical learning via the alternating direction method of multipliers,

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Dis- tributed optimization and statistical learning via the alternating direction method of multipliers,” Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1–122, Mar. 2011

work page 2011

[36] [36]

Similarity of neural network representations revisited,

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton, “Similarity of neural network representations revisited,” in Proc. 36th Int. Conf. Mach. Learn., California, USA, Jun. 2019, pp. 3519–3529

work page 2019

[37] [37]

Visualizing data using t- SNE,

L. van der Maaten and G. Hinton, “Visualizing data using t- SNE,” J. Mach. Learn. Res, vol. 9, no. 11, pp. 2579–2605, 2008. Algorithm 1 Optimization Procedure for SAGL Input: An unlabeled training dataset Dtr = {x1, x2, . . . ,xn} and a testing dataset Dts = {ˆx1, ˆx2, . . . ,ˆxˆn} belonging to C categories, and L pre-trained models. Parameters: The numbe...

work page 2008

[38] [38]

The batch size for both training and testing is selected from the set {100, 500, 1,000, 5,000, 10,000}

Parameter Settings: During both training and test- ing, the learning rate for the proposed SAGL model is empirically set to 5 × 10−4 for the KITTI, Flowers, Food101 and ImageNet-1K datasets and to 1×10−3 for all other datasets. The batch size for both training and testing is selected from the set {100, 500, 1,000, 5,000, 10,000}. Specifically, the batch s...

work page

[39] [39]

We adopt centered kernel alignment (CKA) [ 36] to measure the simi- larity between feature distributions produced by different pretrained model pairs

Measuring Similarity Across Heterogeneous Views: Different backbones exhibit different representation levels: transformer-based models (e.g., DINOv3, SigLIP 2 and CLIP ViT-L/14) typically produce global semantic repre- sentations, while convolutional models (e.g., ConvNeXt V2) capture more localized spatial features. We adopt centered kernel alignment (CK...

work page

[40] [40]

For fair comparison, we report the computational cost of the competing methods that utilize two pretrained backbones

Comparison of Training Times for Self-Supervised Learning: To evaluate the training eﬀiciency of the pro- posed SAGL method, we compare the computational costs of TURTLE, MSRL, and SAGL on the training sets of all eight datasets. For fair comparison, we report the computational cost of the competing methods that utilize two pretrained backbones. Fig. 2 sh...

work page

[41] [41]

Visualizations: To evaluate the learned representa- tions, we employ t-SNE [ 37] to visualize three levels of features on three representative datasets of varying scales: Pets, Caltech101, and Food101. Specifically, the three levels are: (1) the original features extracted from the two pretrained backbones, (2) the projected features after the linear tran...

work page

[42] [42]

The sparsity ratio (SR) is defined as the number of nonzero elements in A(l) divided by the total number of elements

Sparsity Analysis on Sparse Attention Graphs: We first analyze the sparsity of the learned attention graphs during training. The sparsity ratio (SR) is defined as the number of nonzero elements in A(l) divided by the total number of elements. Specifically, we examine the sparsity ratios of sparse attention graphs on the two representative datasets, Caltec...

work page

[43] [43]

Consequently, the batch size plays an important role in determining the quality of the learned sparse attention graphs

Parameter Sensitivity Analysis: By exploiting the sparse self-representation property of features, each rep- resentation is constructed as a sparse linear combination of spatially proximate neighbors. Consequently, the batch size plays an important role in determining the quality of the learned sparse attention graphs. To investigate the sensitivity of SA...

work page

[44] [44]

Convergence Analysis: We empirically evaluate the convergence property of the proposed method across all eight datasets. Fig. 17 shows the convergence curves of Algorithm 1, where the x-axis corresponds to iterations, and the y-axis represents the objective loss defined in Eq. ( 21). A positive constant is added to the y-axis values for better readability...

work page