pith. sign in

arxiv: 2606.08126 · v1 · pith:T7XMC2I7new · submitted 2026-06-06 · 💻 cs.CV

One Stone, Three Birds: Self-adaptive Optimal Transport for Multi-VLM Selection, Adaptation, and Ensembling

Pith reviewed 2026-06-27 19:55 UTC · model grok-4.3

classification 💻 cs.CV
keywords optimal transportvision-language modelsmodel selectiondomain adaptationmodel ensemblingunlabeled target datatraining-free framework
0
0 comments X

The pith

A single consensus transport plan from multiple VLMs solves selection, adaptation, and ensembling without labels or training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that choosing which VLM to trust, adapting it to an unlabeled target domain, and combining several VLMs all reduce to recovering one underlying sample-to-class assignment structure. Different models may disagree on specific images, yet their collective predictions still supply complementary information sufficient to estimate this structure. The method computes a self-adaptive optimal transport plan once from the frozen VLMs and re-uses that plan for ranking models by reliability, fitting transport-guided classifiers, and performing reliability-weighted ensembling.

Core claim

The central claim is that a trustworthy sample-class structure latent in the target set can be recovered by a self-adaptive optimal transport plan computed from the outputs of several frozen candidate VLMs, and that this single plan is sufficient to perform model selection by ranking combined semantic and visual reliability, target adaptation by fitting transport-conditioned visual classifiers, and ensembling by reliability-aware probabilistic integration.

What carries the argument

Self-adaptive optimal transport plan that estimates a consensus sample-to-class assignment from multiple VLM predictions without parameter updates.

If this is right

  • Model selection reduces to ranking the reliability scores induced by the single transport plan.
  • Target adaptation is obtained by training visual classifiers conditioned on the transport assignments.
  • Ensembling is performed by reliability-weighted probabilistic combination of the models.
  • All three tasks are solved in one forward pass over the frozen VLMs with no gradient updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same transport structure could be reused for other downstream tasks that also require sample-to-class assignments, such as active learning or pseudo-label refinement.
  • The method implies that optimal transport may act as a general-purpose consensus layer whenever multiple models observe the same unlabeled data.
  • Testing the framework on candidate pools that include both general and highly specialized VLMs would reveal how much domain diversity is needed for the complementary-evidence assumption to hold.

Load-bearing premise

Outputs from different VLMs supply complementary evidence for the underlying sample-class structure even when their individual predictions conflict on the unlabeled target set.

What would settle it

Run the method on a labeled cross-domain benchmark; if the consensus plan produces worse model ranking accuracy or lower adaptation accuracy than the best single VLM or than simple averaging, the claim is falsified.

Figures

Figures reproduced from arXiv: 2606.08126 by Huafeng Li, Qiyu Xu, Quanxue Gao, Xiangyong Cao, Yonghang Tai, Yu Duan, Zhanxuan Hu.

Figure 1
Figure 1. Figure 1: Evolution of the VLM ecosystem across natural-image, remote [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Model-performance fingerprints across natural-image, remote-sensing, and medical-pathology benchmarks. Rows denote datasets and columns denote [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Framework of OSTB. Starting from an unlabeled target adaptation set and semantic priors, OSTB evaluates a heterogeneous pool of candidate VLMs [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Predicted model ranking versus oracle ranking. Each point denotes a candidate VLM on one benchmark, with the horizontal axis showing the oracle [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of transport-induced GMM adaptation on representative [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Parameter analysis of the entropic regularization coefficient [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Iteration-number analysis of OSTB. The horizontal axis denotes the [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
read the original abstract

Vision-language models (VLMs) enable visual recognition from semantic class descriptions, which makes them attractive when target annotations are scarce or unavailable. Most deployment pipelines, however, first choose a single VLM and then adapt that model to the unlabeled target set. This single-backbone paradigm hides a critical assumption: the selected VLM is already compatible with the target domain. In realistic cross-domain deployment, several general-purpose and domain-specialized VLMs may be plausible, yet no instance-level target labels are available to identify the reliable ones. Deployment therefore requires a coupled solution for model selection, target adaptation, and prediction integration. We revisit this problem from a system-level multi-VLM perspective. Our central observation is that the three decisions above depend on the same latent object: a trustworthy sample-class structure in the target set. Different VLMs may encode different transfer biases and produce conflicting predictions, but their outputs can still provide complementary evidence for estimating this structure. We propose One Stone, Three Birds, a training-free framework based on self-adaptive optimal transport. Given a pool of frozen candidate VLMs, OSTB estimates a consensus sample-to-class transport plan without updating VLM parameters. The learned transport structure is then reused for all deployment objectives: model selection is performed by ranking the combined semantic and visual reliability induced by the consensus plan; target adaptation is obtained by fitting transport-conditioned visual classifiers; and ensembling is implemented through reliability-aware probabilistic integration. Extensive experiments on natural-image, remote-sensing, and medical-pathology benchmarks show that OSTB improves model ranking, adaptation stability, and ensemble robustness under heterogeneous candidate pools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes 'One Stone, Three Birds' (OSTB), a training-free framework using self-adaptive optimal transport to estimate a consensus sample-to-class transport plan from multiple frozen VLMs' outputs on an unlabeled target set. This plan is then reused for model selection (ranking by semantic and visual reliability), target adaptation (fitting transport-conditioned visual classifiers), and ensembling (reliability-aware probabilistic integration). Experiments on natural-image, remote-sensing, and medical-pathology benchmarks demonstrate improvements in model ranking, adaptation stability, and ensemble robustness.

Significance. If the central claim holds, the work offers a practical, unified solution for deploying multiple VLMs in label-scarce cross-domain settings without requiring target annotations or parameter updates. The training-free aspect, the shared latent structure premise, and the empirical results across three domains are strengths that could influence multi-model deployment strategies in computer vision.

major comments (2)
  1. [Abstract] Abstract (central observation paragraph): the premise that 'different VLMs may encode different transfer biases and produce conflicting predictions, but their outputs can still provide complementary evidence for estimating this structure' is load-bearing for all three tasks, yet no concrete test, ablation, or failure case is described to bound when complementarity holds under conflict.
  2. [Abstract] Abstract (method description): the transport plan is presented as simultaneously defining reliability for selection, adaptation, and ensembling while remaining 'self-adaptive' and training-free; without the explicit estimation equations it is unclear whether the plan is recovered from VLM outputs alone or implicitly depends on a fitted quantity derived from the same outputs.
minor comments (1)
  1. The abstract states 'extensive experiments' but provides no details on the number of candidate VLMs, specific datasets, or comparison baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract (central observation paragraph): the premise that 'different VLMs may encode different transfer biases and produce conflicting predictions, but their outputs can still provide complementary evidence for estimating this structure' is load-bearing for all three tasks, yet no concrete test, ablation, or failure case is described to bound when complementarity holds under conflict.

    Authors: We agree that the complementarity premise is central to the framework and that the current manuscript does not sufficiently bound the conditions under which it holds. In the revised version we will add a new subsection (in the experiments or analysis section) containing (i) controlled ablations that systematically vary the level of prediction conflict among the VLMs (e.g., by selecting subsets with increasing disagreement or by injecting calibrated label noise) and (ii) a failure-case study that identifies regimes where the consensus transport plan ceases to improve over single-VLM baselines. These additions will make the scope and limitations of the central observation explicit. revision: yes

  2. Referee: [Abstract] Abstract (method description): the transport plan is presented as simultaneously defining reliability for selection, adaptation, and ensembling while remaining 'self-adaptive' and training-free; without the explicit estimation equations it is unclear whether the plan is recovered from VLM outputs alone or implicitly depends on a fitted quantity derived from the same outputs.

    Authors: We acknowledge that the abstract does not contain the estimation equations, which can leave the precise origin of the transport plan ambiguous. The plan is recovered exclusively from the frozen VLM output distributions on the unlabeled target set via a self-adaptive optimal-transport procedure whose only variables are the transport couplings themselves; no additional parameters are fitted. In the revision we will (i) insert a concise reference to the key estimation equation in the abstract and (ii) expand the method section to present the full self-adaptive OT formulation with explicit notation showing that all quantities are derived directly from the VLM logits. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained from stated premise

full rationale

The paper's central claim follows directly from the explicit observation that model selection, adaptation, and ensembling share a latent sample-class structure recoverable from complementary VLM outputs via self-adaptive OT. The transport plan is estimated from frozen VLM predictions on the target set and then applied to the three tasks; this reuse is a logical consequence of the premise rather than a definitional loop or fitted quantity renamed as prediction. No equations, self-citations, uniqueness theorems, or ansatzes are shown reducing the result to its inputs by construction. The framework is described as training-free with empirical support across benchmarks, making the derivation independent of the target quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that conflicting VLM outputs still supply complementary evidence for a latent sample-class structure. No free parameters or invented entities are identifiable from the abstract alone.

axioms (1)
  • domain assumption Different VLMs may encode different transfer biases and produce conflicting predictions, but their outputs can still provide complementary evidence for estimating this structure.
    Stated explicitly as the central observation in the abstract.

pith-pipeline@v0.9.1-grok · 5848 in / 1415 out tokens · 19470 ms · 2026-06-27T19:55:45.478532+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

92 extracted references · 8 linked inside Pith

  1. [1]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inProceedings of the 38th International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 139, 2021, ...

  2. [2]

    SigLIP 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features,

    M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y . Xia, B. Mustafa, O. H ´enaff, J. Harmsen, A. Steiner, and X. Zhai, “SigLIP 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features,”arXiv preprint arXiv:2502.14786, 2025

  3. [3]

    Reproducible scaling laws for contrastive language-image learning,

    M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Reproducible scaling laws for contrastive language-image learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 2818–2829

  4. [4]

    EV A-CLIP: Improved training techniques for CLIP at scale,

    Q. Sun, Y . Fang, L. Wu, X. Wang, and Y . Cao, “EV A-CLIP: Improved training techniques for CLIP at scale,”arXiv preprint arXiv:2303.15389, 2023

  5. [5]

    Demystifying CLIP data,

    H. Xu, S. Xie, X. Tan, P.-Y . Huang, R. Howes, V . Sharma, S.-W. Li, G. Ghosh, L. Zettlemoyer, and C. Feichtenhofer, “Demystifying CLIP data,” inThe Twelfth International Conference on Learning Representations (ICLR), 2024

  6. [6]

    RemoteCLIP: A vision language foundation model for remote sensing,

    F. Liu, D. Chen, Z. Guan, X. Zhou, J. Zhu, Q. Ye, L. Fu, and J. Zhou, “RemoteCLIP: A vision language foundation model for remote sensing,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1– 16, 2024

  7. [7]

    RS5M and GeoRSCLIP: A large scale vision-language dataset and a large vision-language model for remote sensing,

    Z. Zhang, T. Zhao, Y . Guo, and J. Yin, “RS5M and GeoRSCLIP: A large scale vision-language dataset and a large vision-language model for remote sensing,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–23, 2024

  8. [8]

    A multimodal biomedical foundation model trained from fifteen million image-text pairs,

    S. Zhang, Y . Xu, N. Usuyama, H. Xu, J. Bagga, R. Tinn, S. Preston, R. Rao, M. Wei, N. Valluri, C. Wong, A. Tupini, Y . Wang, M. Mazzola, S. Shukla, L. Liden, J. Gao, A. Crabtree, B. Piening, C. Bifulco, M. P. Lungren, T. Naumann, S. Wang, and H. Poon, “A multimodal biomedical foundation model trained from fifteen million image-text pairs,”NEJM AI, vol. 2...

  9. [9]

    A visual-language foundation model for pathology image analysis using medical twitter,

    Z. Huang, F. Bianchi, M. Yuksekgonul, T. J. Montine, and J. Zou, “A visual-language foundation model for pathology image analysis using medical twitter,”Nature Medicine, vol. 29, no. 9, pp. 2307–2316, 2023

  10. [10]

    A visual-language foundation model for computational pathology,

    M. Y . Lu, B. Chen, D. F. K. Williamson, R. J. Chen, I. Liang, T. Ding, G. Jaume, I. Odintsov, L. P. Le, G. Gerber, A. V . Parwani, A. Zhang, and F. Mahmood, “A visual-language foundation model for computational pathology,”Nature Medicine, vol. 30, no. 3, pp. 863–874, 2024

  11. [11]

    Learning to prompt for vision- language models,

    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision- language models,”International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022

  12. [12]

    Conditional prompt learning for vision-language models,

    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 16 816–16 825

  13. [13]

    LaFTer: Label-free tuning of zero-shot classifier using language and unlabeled image collections,

    M. J. Mirza, L. Karlinsky, W. Lin, M. Kozinski, H. Possegger, R. Feris, and H. Bischof, “LaFTer: Label-free tuning of zero-shot classifier using language and unlabeled image collections,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023, pp. 5765– 5777

  14. [14]

    Test-time prompt tuning for zero-shot generalization in vision-language models,

    M. Shu, W. Nie, D.-A. Huang, Z. Yu, T. Goldstein, A. Anandkumar, and C. Xiao, “Test-time prompt tuning for zero-shot generalization in vision-language models,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022, pp. 14 274–14 289

  15. [15]

    Transductive zero-shot and few-shot CLIP,

    S. Martin, Y . Huang, F. Shakeri, J.-C. Pesquet, and I. Ben Ayed, “Transductive zero-shot and few-shot CLIP,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 28 816–28 826

  16. [16]

    Frustratingly easy test-time adaptation of vision-language models,

    M. Farina, G. Franchi, G. Iacca, M. Mancini, and E. Ricci, “Frustratingly easy test-time adaptation of vision-language models,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 37, 2024, pp. 129 062–129 093

  17. [17]

    Efficient test-time adaptation of vision-language models,

    A. Karmanov, D. Guan, S. Lu, A. El Saddik, and E. Xing, “Efficient test-time adaptation of vision-language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14 162–14 171

  18. [18]

    Computational optimal transport: With appli- cations to data science,

    G. Peyr ´e and M. Cuturi, “Computational optimal transport: With appli- cations to data science,”Foundations and Trends® in Machine Learning, vol. 11, no. 5–6, pp. 355–607, 2019

  19. [19]

    Sota: Self-adaptive optimal transport for zero-shot classification with multiple foundation models,

    Z. Hu, Q. Xu, Y . Duan, Y . Tai, and H. Li, “Sota: Self-adaptive optimal transport for zero-shot classification with multiple foundation models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2026, pp. 26 624–26 634

  20. [20]

    Scaling up visual and vision-language representation learning with noisy text supervision,

    C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. V . Le, Y .- H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inProceedings of the 38th International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 139, 2021, pp. 4904– 4916

  21. [21]

    FLA V A: A foundational language and vision alignment model,

    A. Singh, R. Hu, V . Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela, “FLA V A: A foundational language and vision alignment model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 15 617–15 629

  22. [22]

    CoCa: Contrastive captioners are image-text foundation mod- els,

    J. Yu, Z. Wang, V . Vasudevan, L. Yeung, M. Seyedhosseini, and Y . Wu, “CoCa: Contrastive captioners are image-text foundation mod- els,”Transactions on Machine Learning Research, 2022

  23. [23]

    Jina-CLIP- v2: Multilingual multimodal embeddings for text and images,

    A. Koukounas, G. Mastrapas, S. Eslami, B. Wang, M. K. Akram, M. G ¨unther, I. Mohr, S. Sturua, N. Wang, and H. Xiao, “Jina-CLIP- v2: Multilingual multimodal embeddings for text and images,”arXiv preprint arXiv:2412.08802, 2024. 17

  24. [24]

    Perception encoder: The best visual embeddings are not at the output of the network,

    D. Bolya, P.-Y . Huang, P. Sun, J. H. Cho, A. Madotto, C. Wei, T. Ma, J. Zhi, J. Rajasegaran, H. Rasheed, J. Wang, M. Monteiro, H. Xu, S. Dong, N. Ravi, D. Li, P. Doll ´ar, and C. Feichtenhofer, “Perception encoder: The best visual embeddings are not at the output of the network,”arXiv preprint arXiv:2504.13181, 2025

  25. [25]

    UniMed-CLIP: Towards a unified image-text pretraining paradigm for diverse medical imaging modalities,

    M. U. Khattak, S. Kunhimon, M. Naseer, S. Khan, and F. S. Khan, “UniMed-CLIP: Towards a unified image-text pretraining paradigm for diverse medical imaging modalities,”arXiv preprint arXiv:2412.10372, 2024

  26. [26]

    Knowledge-enhanced pretraining for vision-language pathology foundation model on cancer diagnosis,

    X. Zhou, L. Sun, D. He, W. Guan, G. Wang, R. Wang, L. Wang, X. Yuan, X. Sun, Y . Zhang, K. Sun, Y . Wang, and W. Xie, “Knowledge-enhanced pretraining for vision-language pathology foundation model on cancer diagnosis,”Cancer Cell, vol. 44, no. 4, pp. 777–791, 2026

  27. [27]

    PMC-CLIP: Contrastive language-image pre-training using biomedical documents,

    W. Lin, Z. Zhao, X. Zhang, C. Wu, Y . Zhang, Y . Wang, and W. Xie, “PMC-CLIP: Contrastive language-image pre-training using biomedical documents,” inMedical Image Computing and Computer Assisted In- tervention – MICCAI 2023, 2023, pp. 525–536

  28. [28]

    PathGen-1.6M: 1.6 million pathology image-text pairs generation through multi-agent collaboration,

    Y . Sun, Y . Zhang, Y . Si, C. Zhu, K. Zhang, Z. Shui, J. Li, X. Gong, X. Lyu, T. Lin, and L. Yang, “PathGen-1.6M: 1.6 million pathology image-text pairs generation through multi-agent collaboration,” inPro- ceedings of the International Conference on Learning Representations (ICLR), vol. 2025, 2025, pp. 94 611–94 653

  29. [29]

    A vision– language foundation model for precision oncology,

    J. Xiang, X. Wang, X. Zhang, Y . Xi, F. Eweje, Y . Chen, Y . Li, C. Bergstrom, M. Gopaulchan, T. Kim, K.-H. Yu, S. Willens, F. M. Olguin, J. J. Nirschl, J. Neal, M. Diehn, S. Yang, and R. Li, “A vision– language foundation model for precision oncology,”Nature, vol. 638, pp. 769–778, 2025

  30. [30]

    Quilt-1m: One million image-text pairs for histopathology,

    W. O. Ikezogwo, M. S. Seyfioglu, F. Ghezloo, D. S. C. Geva, F. S. Mohammed, P. K. Anand, R. Krishna, and L. Shapiro, “Quilt-1m: One million image-text pairs for histopathology,” inAdvances in Neural Information Processing Systems, vol. 36, 2023, pp. 37 995–38 017

  31. [31]

    SkyScript: A large and semantically diverse vision-language dataset for remote sens- ing,

    Z. Wang, R. Prabha, T. Huang, J. Wu, and R. Rajagopal, “SkyScript: A large and semantically diverse vision-language dataset for remote sens- ing,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 38, no. 6, 2024, pp. 5805–5813

  32. [32]

    Multilingual vision- language pre-training for the remote sensing domain,

    J. D. Silva, J. Magalh ˜aes, D. Tuia, and B. Martins, “Multilingual vision- language pre-training for the remote sensing domain,” inProceedings of the 32nd ACM International Conference on Advances in Geographic Information Systems (SIGSPATIAL), 2024, pp. 220–232

  33. [33]

    RSDiX: Lightweight and data-efficient VLMs for remote sensing through self-distillation,

    A. Terlizzi, A. Nazzaro, L. Bernardi, F. Bardozzo, and R. Tagliaferri, “RSDiX: Lightweight and data-efficient VLMs for remote sensing through self-distillation,” inProceedings of the International Joint Conference on Neural Networks (IJCNN), 2025, pp. 1–10

  34. [34]

    Learning generalized zero- shot learners for open-domain image geolocalization,

    L. Haas, S. Alberti, and M. Skreta, “Learning generalized zero- shot learners for open-domain image geolocalization,”arXiv preprint arXiv:2302.00275, 2023

  35. [35]

    COSMIC: Clique-oriented semantic multi-space integration for robust CLIP test-time adaptation,

    F. Huang, J. Jiang, Q. Jiang, H. Li, F. N. Khan, and Z. Wang, “COSMIC: Clique-oriented semantic multi-space integration for robust CLIP test-time adaptation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 9772– 9781

  36. [36]

    Dual prototype evolving for test-time generalization of vision-language models,

    C. Zhang, S. Stepputtis, K. Sycara, and Y . Xie, “Dual prototype evolving for test-time generalization of vision-language models,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 37, 2024, pp. 32 111–32 136

  37. [37]

    Dual memory networks: A versatile adaptation approach for vision-language models,

    Y . Zhang, W. Zhu, H. Tang, Z. Ma, K. Zhou, and L. Zhang, “Dual memory networks: A versatile adaptation approach for vision-language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 28 718–28 728

  38. [38]

    DINOv2: Learning robust visual features without supervision,

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “DINOv2: Learning robust visual features withou...

  39. [39]

    Open- vocabulary panoptic segmentation with text-to-image diffusion models,

    J. Xu, S. Liu, A. Vahdat, W. Byeon, X. Wang, and S. De Mello, “Open- vocabulary panoptic segmentation with text-to-image diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 2955–2966

  40. [40]

    Segment everything everywhere all at once,

    X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Wang, L. Wang, J. Gao, and Y . J. Lee, “Segment everything everywhere all at once,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023

  41. [41]

    Grounded SAM: Assembling open-world models for diverse visual tasks,

    T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y . Chen, F. Yan, Z. Zeng, H. Zhang, F. Li, J. Yang, H. Li, Q. Jiang, and L. Zhang, “Grounded SAM: Assembling open-world models for diverse visual tasks,”arXiv preprint arXiv:2401.14159, 2024

  42. [42]

    SAM 3: Segment anything with concepts,

    N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwalaet al., “SAM 3: Segment anything with concepts,”arXiv preprint arXiv:2511.16719, 2025

  43. [43]

    An information-theoretic approach to transferability in task transfer learning,

    Y . Bao, Y . Li, S.-L. Huang, L. Zhang, L. Zheng, A. R. Zamir, and L. J. Guibas, “An information-theoretic approach to transferability in task transfer learning,” in2019 IEEE International Conference on Image Processing (ICIP), 2019, pp. 2309–2313

  44. [44]

    LEEP: A new measure to evaluate transferability of learned representations,

    C. Nguyen, T. Hassner, M. Seeger, and C. Archambeau, “LEEP: A new measure to evaluate transferability of learned representations,” in Proceedings of the 37th International Conference on Machine Learning (ICML), vol. 119, 2020, pp. 7294–7305

  45. [45]

    LogME: Practical assessment of pre-trained models for transfer learning,

    K. You, Y . Liu, J. Wang, and M. Long, “LogME: Practical assessment of pre-trained models for transfer learning,” inProceedings of the 38th International Conference on Machine Learning (ICML), vol. 139, 2021, pp. 12 133–12 143

  46. [46]

    Frustratingly easy transferability estimation,

    L.-K. Huang, J. Huang, Y . Rong, Q. Yang, and Y . Wei, “Frustratingly easy transferability estimation,” inProceedings of the 39th International Conference on Machine Learning (ICML), vol. 162, 2022, pp. 9201– 9225

  47. [47]

    Scalable diverse model selec- tion for accessible transfer learning,

    D. Bolya, R. Mittapalli, and J. Hoffman, “Scalable diverse model selec- tion for accessible transfer learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 34, 2021, pp. 19 301–19 312

  48. [48]

    Transferability metrics for selecting source model ensembles,

    A. Agostinelli, J. Uijlings, T. Mensink, and V . Ferrari, “Transferability metrics for selecting source model ensembles,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 7936–7946

  49. [49]

    Building a winning team: Selecting source model ensembles using a submodular transferability estimation approach,

    V . K. B, S. Bachu, T. Garg, N. L. Narasimhan, R. Konuru, and V . N. Balasubramanian, “Building a winning team: Selecting source model ensembles using a submodular transferability estimation approach,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 11 609–11 620

  50. [50]

    How stable are transferability metrics evaluations?

    A. Agostinelli, M. P ´andy, J. Uijlings, T. Mensink, and V . Ferrari, “How stable are transferability metrics evaluations?” inProceedings of the European Conference on Computer Vision (ECCV), 2022, pp. 303–321

  51. [51]

    Rethinking model selection in VLM through the lens of Gromov-Wasserstein distance,

    M. Li, Y . Liu, J. Ma, E. Osborne, B. Han, and T. Liu, “Rethinking model selection in VLM through the lens of Gromov-Wasserstein distance,” arXiv preprint arXiv:2605.01325, 2026

  52. [52]

    Simple and scalable predictive uncertainty estimation using deep ensembles,

    B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,” inAdvances in Neural Information Processing Systems, vol. 30, 2017

  53. [53]

    Model soups: Averaging weights of multiple fine- tuned models improves accuracy without increasing inference time,

    M. Wortsman, G. Ilharco, S. Y . Gadre, R. Roelofs, R. Gontijo Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y . Carmon, S. Kornblith, and L. Schmidt, “Model soups: Averaging weights of multiple fine- tuned models improves accuracy without increasing inference time,” in Proceedings of the 39th International Conference on Machine Learning (ICML), vol. 162, 2022

  54. [54]

    Unified optimal transport framework for universal domain adaptation,

    W. Chang, Y . Shi, H. D. Tuan, and J. Wang, “Unified optimal transport framework for universal domain adaptation,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022, pp. 29 512– 29 524

  55. [55]

    PLOT: Prompt learning with optimal transport for vision-language models,

    G. Chen, W. Yao, X. Song, X. Li, Y . Rao, and K. Zhang, “PLOT: Prompt learning with optimal transport for vision-language models,” inThe Eleventh International Conference on Learning Representations (ICLR), 2023

  56. [56]

    Recover and match: Open-vocabulary multi-label recognition through knowledge-constrained optimal transport,

    H. Tan, Z. Tan, J. Li, A. Liu, J. Wan, and Z. Lei, “Recover and match: Open-vocabulary multi-label recognition through knowledge-constrained optimal transport,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 4650– 4660

  57. [57]

    AWT: Transferring vision- language models via augmentation, weighting, and transportation,

    Y . Zhu, Y . Ji, Z. Zhao, G. Wu, and L. Wang, “AWT: Transferring vision- language models via augmentation, weighting, and transportation,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 37, 2024, pp. 25 561–25 591

  58. [58]

    A tutorial on MM algorithms,

    D. R. Hunter and K. Lange, “A tutorial on MM algorithms,”The American Statistician, vol. 58, no. 1, pp. 30–37, 2004

  59. [59]

    ImageNet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255

  60. [60]

    SUN database: Large-scale scene recognition from abbey to zoo,

    J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “SUN database: Large-scale scene recognition from abbey to zoo,” inPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 3485–3492

  61. [61]

    Fine- grained visual classification of aircraft,

    S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine- grained visual classification of aircraft,”arXiv preprint arXiv:1306.5151, 2013. 18

  62. [62]

    EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classi- fication,

    P. Helber, B. Bischke, A. Dengel, and D. Borth, “EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classi- fication,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 12, no. 7, pp. 2217–2226, 2019

  63. [63]

    3D object representations for fine-grained categorization,

    J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3D object representations for fine-grained categorization,” inProceedings of the IEEE/CVF Inter- national Conference on Computer Vision Workshops (ICCVW), 2013, pp. 554–561

  64. [64]

    Food-101: Mining discriminative components with random forests,

    L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101: Mining discriminative components with random forests,” inComputer Vision – ECCV 2014. Springer, 2014, pp. 446–461

  65. [65]

    Cats and dogs,

    O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V . Jawahar, “Cats and dogs,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 3498–3505

  66. [66]

    Automated flower classification over a large number of classes,

    M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” inProceedings of the Sixth Indian Conference on Computer Vision, Graphics & Image Processing, 2008, pp. 722–729

  67. [67]

    Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,

    L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” inProceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Work- shops, 2004, p. 178

  68. [68]

    Describing textures in the wild,

    M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 3606– 3613

  69. [69]

    UCF101: A dataset of 101 human actions classes from videos in the wild,

    K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,”arXiv preprint arXiv:1212.0402, 2012

  70. [70]

    Learning multiple layers of features from tiny images,

    A. Krizhevsky, “Learning multiple layers of features from tiny images,” University of Toronto, Tech. Rep., 2009

  71. [71]

    The caltech-UCSD birds-200-2011 dataset,

    C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The caltech-UCSD birds-200-2011 dataset,” inCalifornia Institute of Tech- nology Technical Report CNS-TR-2011-001, 2011

  72. [72]

    Proportion constrained weakly supervised histopathology image clas- sification,

    J. Silva-Rodr ´ıguez, A. Schmidt, M. A. Sales, R. Molina, and V . Naranjo, “Proportion constrained weakly supervised histopathology image clas- sification,”Computers in Biology and Medicine, vol. 147, p. 105714, 2022

  73. [73]

    Rotation equivariant CNNs for digital pathology,

    B. S. Veeling, J. Linmans, J. Winkens, T. Cohen, and M. Welling, “Rotation equivariant CNNs for digital pathology,” inMedical Im- age Computing and Computer Assisted Intervention – MICCAI 2018. Springer, 2018, pp. 210–218

  74. [74]

    Osteosarcoma data from UT southwestern/UT dallas for viable and necrotic tumor assessment,

    P. Leavey, A. Sengupta, D. Rakheja, O. Daescu, H. B. Arunachalam, and R. Mishra, “Osteosarcoma data from UT southwestern/UT dallas for viable and necrotic tumor assessment,” The Cancer Imaging Archive, 2019

  75. [75]

    BACH: Grand challenge on breast cancer histology images,

    G. Aresta, T. Ara ´ujo, S. Kwok, S. S. Chennamsetty, M. Safwan, V . Alex, B. Marami, M. Prastawa, M. Chan, M. Donovanet al., “BACH: Grand challenge on breast cancer histology images,”Medical Image Analysis, vol. 56, pp. 122–139, 2019

  76. [76]

    A dataset for breast cancer histopathological image classification,

    F. A. Spanhol, L. S. Oliveira, C. Petitjean, and L. Heutte, “A dataset for breast cancer histopathological image classification,”IEEE Transactions on Biomedical Engineering, vol. 63, no. 7, pp. 1455–1462, 2016

  77. [77]

    Deep learning for the detection of anatomical tissue structures and neoplasms of the skin on scanned histopathological tissue sections,

    K. Kriegsmann, F. L ¨obers, C. Zgorzelski, J. Kriegsmann, C. Janssen, R. R. Meliß, T. Muley, U. Sack, G. Steinbuss, and M. Kriegsmann, “Deep learning for the detection of anatomical tissue structures and neoplasms of the skin on scanned histopathological tissue sections,” Frontiers in Oncology, vol. 12, p. 1022967, 2022

  78. [78]

    Lung and colon cancer histopathological image dataset (LC25000),

    A. A. Borkowski, M. M. Bui, L. B. Thomas, C. P. Wilson, L. A. DeLand, and S. M. Mastorides, “Lung and colon cancer histopathological image dataset (LC25000),”arXiv preprint arXiv:1912.12142, 2019

  79. [79]

    100,000 histological images of human colorectal cancer and healthy tissue,

    J. N. Kather, N. Halama, and A. Marx, “100,000 histological images of human colorectal cancer and healthy tissue,”Zenodo, 2018

  80. [80]

    Multi-layer pseudo-supervision for histopathology tissue semantic segmentation using patch-level classification labels,

    C. Han, J. Lin, J. Mai, Y . Wang, Q. Zhang, B. Zhao, X. Chen, X. Pan, Z. Shi, Z. Xuet al., “Multi-layer pseudo-supervision for histopathology tissue semantic segmentation using patch-level classification labels,” Medical Image Analysis, vol. 80, p. 102487, 2022

Showing first 80 references.