Concept Alignment Contrast and Long-Short Prompt Memory for Test-Time Adaptation of SAM3 in Medical Image Segmentation

Guotai Wang; Jianghao Wu; Ping Ye; Shaoting Zhang; Yubo Zhou

arxiv: 2606.22963 · v1 · pith:AAFY4YV6new · submitted 2026-06-22 · 💻 cs.CV

Concept Alignment Contrast and Long-Short Prompt Memory for Test-Time Adaptation of SAM3 in Medical Image Segmentation

Yubo Zhou , Jianghao Wu , Ping Ye , Shaoting Zhang , Guotai Wang This is my paper

Pith reviewed 2026-06-26 08:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords test-time adaptationSAM3medical image segmentationconcept alignment contrastprompt memorypseudo-label generationvision-language modelsdomain adaptation

0 comments

The pith

CM-TTA adapts SAM3 to medical images by selecting reliable augmentations through textual-visual consistency and using dual prompt memory for stable updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CM-TTA to handle the performance drop of SAM3 when moving from natural to medical images caused by differing imaging styles. It replaces image-level uncertainty minimization with a Concept Alignment Contrast metric that scores augmented views according to how well their predictions match textual concepts, then picks the highest-scoring view as supervision. A Long-Short Prompt Memory module combines recent high-scoring prompts for fast local changes with a persistent global prompt for stability across test samples. Dense supervision from the resulting pseudo-labels then updates the prompt embeddings. Experiments on prostate and skin lesion segmentation show the framework exceeds prior test-time adaptation approaches for SAM3.

Core claim

The central claim is that textual-visual semantic consistency, measured by the Concept Alignment Contrast metric, can identify high-quality region predictions among multiple augmentations of a test image; these selections, together with a Long-Short Prompt Memory that fuses recent prompts dynamically while preserving a stable global prompt, supply enhanced pseudo-labels that serve as dense supervision for one-pass prompt optimization, enabling SAM3 to adapt effectively to medical domains without annotations.

What carries the argument

The Concept Alignment Contrast (CAC) metric, which uses textual-visual semantic consistency to rank prediction quality, and the Long-Short Prompt Memory (LSPM) module, which maintains both short-term dynamic prompts and a long-term stable prompt for pseudo-label generation.

If this is right

The selected augmented views provide denser, region-level supervision that improves over image-level uncertainty methods.
The long memory component prevents drift during continual one-pass adaptation across sequential test samples.
Prompt embeddings can be updated directly with the enhanced pseudo-labels produced by the combined short and long memories.
Performance gains appear on both prostate and skin lesion segmentation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same consistency signal could be tested on other vision-language segmentation models facing domain shifts.
If medical textual descriptions are sparse or imprecise, the CAC metric may lose reliability and require alternative concept sources.
Applying the framework to additional modalities such as MRI or ultrasound would test whether the memory balance generalizes beyond the reported datasets.

Load-bearing premise

Textual-visual semantic consistency can reliably indicate region-level prediction correctness in medical images and therefore supply usable supervision.

What would settle it

If CAC-selected augmentations show no higher overlap with ground-truth masks than randomly chosen views on a held-out medical segmentation dataset, the quality-evaluation step would be shown to be ineffective.

Figures

Figures reproduced from arXiv: 2606.22963 by Guotai Wang, Jianghao Wu, Ping Ye, Shaoting Zhang, Yubo Zhou.

**Figure 1.** Figure 1: Overview of our CM-TTA for test-time adaptation of SAM3. and skin lesion segmentation tasks by an average Dice gain of 3.17 and 6.07 percentage points over the baseline. 2 Method Previous work has shown that directly fine-tuning all parameters of foundation models is computationally expensive and can reduce the generalizability, while prompt tuning avoids distorting pre-trained representations and enables… view at source ↗

**Figure 2.** Figure 2: Qualitative comparison of different TTA methods on two datasets. baseline, meaning inference with SAM3 without adaptation. Method-specific hyperparameters of the compared methods were tuned for optimal results [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Comparison of the quality of the best augmented view selected by different metrics on Promise12. (b) Effect of short memory bank length L on average Dice. (c) Effect of the weight of Lc on average Dice. 0.57s for the inference of one test sample on Promise12, which is close to the inference time using unadapted SAM3 directly (0.31s). The qualitative results shown in [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

read the original abstract

Concept segmentation models like Segment Anything Model 3 (SAM3) show strong generalization on natural images, yet their performance degrades in medical imaging due to the domain gap caused by different imaging principles and styles. Test-Time Adaptation (TTA) is essential for improving the testing performance by updating the model on the fly without annotations. However, existing vision-language TTA methods are mainly driven by image-level uncertainty minimization, which does not necessarily reflect region-level semantic correctness in medical segmentation. Moreover, they often lack mechanisms to maintain stability in continual one-pass adaptation, leading to limited performance when reliable dense supervision is missing for segmentation. To address these issues, we propose Concept Alignment Contrast and LongShort Prompt Memory for Test-Time Adaptation (CM-TTA) of SAM3 for medical images. First, for a test sample with multiple augmentations, we introduce a novel Concept Alignment Contrast (CAC) metric, which leverages textual-visual semantic consistency to robustly evaluate prediction quality to select the best augmented view as the supervision. Second, to balance rapid and stable adaptation, we design a Long-Short Prompt Memory (LSPM) module. The short memory dynamically fuses recent prompts based on CAC scores for agile local adaptation, while the long memory maintains a stable global prompt to generate enhanced pseudo-labels. Finally, a Densely Supervised Prompt Update (DSPU) strategy is proposed to optimize the prompt embeddings with enhanced pseudo labels as dense supervision. Extensive experiments on prostate and skin lesion segmentation demonstrate that our CM-TTA framework significantly outperforms existing methods for TTA of SAM3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CM-TTA proposes CAC for view selection plus LSPM for prompt stability in SAM3 TTA, but the abstract supplies no numbers so the outperformance claim cannot be checked.

read the letter

The paper's main move is a one-pass TTA setup for SAM3 on medical scans that picks the strongest augmentation via Concept Alignment Contrast (text-image similarity on predicted regions) and then keeps adaptation steady with a long-short prompt memory that mixes recent and global prompts. DSPU then uses the selected view for dense prompt updates. This directly targets the gap that image-level uncertainty minimization often fails to give reliable region supervision in segmentation.

The combination of CAC selection with dual-memory prompt handling is the concrete addition over prior vision-language TTA work. It tries to turn semantic consistency into a practical signal for pseudo-label quality without needing annotations, which is a reasonable engineering step for continual medical adaptation.

The obvious soft spot is the complete absence of results. The abstract says the method significantly outperforms baselines on prostate and skin lesion tasks, yet gives no Dice scores, no dataset details, no baselines, no ablations, and no error bars. That makes it impossible to judge whether CAC actually ranks region quality correctly once the domain shift hits MRI or dermoscopy. The stress-test concern about natural-image VLMs and SAM3 embeddings decorrelating from medical pixel correctness looks like the load-bearing assumption; if it does not hold, the pseudo-labels become noisy and the claimed gains disappear.

This is for people already working on prompt-based adaptation of foundation models in medical imaging. A reader in that narrow area might pick up the memory design or the consistency metric as implementation ideas.

Send it for peer review so the experiments can be examined; the framework is coherent on paper but the evidence gap is large enough that referees need to see the numbers before any stronger judgment.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes the CM-TTA framework for test-time adaptation of SAM3 in medical image segmentation. It introduces Concept Alignment Contrast (CAC) to select the best augmented view via textual-visual semantic consistency, Long-Short Prompt Memory (LSPM) to fuse recent and global prompts for agile yet stable adaptation, and Densely Supervised Prompt Update (DSPU) to optimize prompt embeddings using enhanced pseudo-labels. The central claim is that extensive experiments on prostate and skin lesion segmentation demonstrate significant outperformance over existing TTA methods for SAM3.

Significance. If the empirical results hold, the framework offers a region-level supervision mechanism for TTA that addresses limitations of image-level uncertainty minimization in prior vision-language methods. This could improve adaptation of foundation models like SAM3 to medical domains without annotations. The work explicitly targets gaps in stability during continual one-pass adaptation.

major comments (2)

Abstract: the claim that 'extensive experiments on prostate and skin lesion segmentation demonstrate that our CM-TTA framework significantly outperforms existing methods' supplies no quantitative results, baselines, error bars, dataset details, or ablation studies, rendering the central empirical claim unverifiable from the provided text.
Method section describing CAC: the metric assumes cosine similarity between SAM3 region embeddings and medical concept text (e.g., 'prostate gland') robustly ranks region-level prediction quality. Because SAM3 and the underlying VLM were trained predominantly on natural images, this correlation may not hold under medical domain shift (MRI intensity patterns or dermoscopy texture), propagating noisy pseudo-labels into DSPU and undermining the outperformance claim. Direct validation (e.g., CAC vs. ground-truth Dice correlation on held-out data) is required.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the presentation and empirical support without altering the core contributions.

read point-by-point responses

Referee: Abstract: the claim that 'extensive experiments on prostate and skin lesion segmentation demonstrate that our CM-TTA framework significantly outperforms existing methods' supplies no quantitative results, baselines, error bars, dataset details, or ablation studies, rendering the central empirical claim unverifiable from the provided text.

Authors: We agree that the abstract's empirical claim would benefit from added specificity for immediate verifiability. In the revised manuscript we will incorporate concise quantitative highlights (e.g., mean Dice improvements and standard deviations over the listed baselines on the two datasets) while preserving the abstract's brevity. revision: yes
Referee: Method section describing CAC: the metric assumes cosine similarity between SAM3 region embeddings and medical concept text (e.g., 'prostate gland') robustly ranks region-level prediction quality. Because SAM3 and the underlying VLM were trained predominantly on natural images, this correlation may not hold under medical domain shift (MRI intensity patterns or dermoscopy texture), propagating noisy pseudo-labels into DSPU and undermining the outperformance claim. Direct validation (e.g., CAC vs. ground-truth Dice correlation on held-out data) is required.

Authors: The concern about domain shift is well-taken and highlights a potential limitation of relying on cross-modal similarity. Our current experiments already show that CAC yields higher final segmentation accuracy than uncertainty baselines and that ablations removing CAC degrade performance; however, we do not presently report an explicit correlation plot of CAC scores versus ground-truth Dice on held-out data. We will add this analysis in the revision using the available validation splits to directly quantify the metric's reliability under the observed domain shift. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical TTA framework with independent components

full rationale

The paper describes an empirical test-time adaptation method (CM-TTA) built from three proposed modules (CAC metric, LSPM, DSPU) that are motivated by stated gaps in prior TTA literature and validated via experiments on prostate and skin lesion data. No equations, derivations, or parameter-fitting steps are presented that reduce by construction to the inputs; the central claims rest on the assumption that textual-visual consistency can serve as supervision, which is an external modeling choice rather than a self-referential definition. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

Only the abstract is available, so the ledger is based on explicitly introduced components; no free parameters are named, but the method relies on unstated hyperparameters for memory fusion and contrast computation.

axioms (1)

domain assumption Textual-visual semantic consistency reliably indicates region-level semantic correctness for medical image predictions
This underpins the CAC metric as described in the abstract for selecting supervision views.

invented entities (3)

Concept Alignment Contrast (CAC) no independent evidence
purpose: Metric leveraging textual-visual semantic consistency to evaluate and select augmented views as supervision
Newly proposed component without external validation or prior citation mentioned.
Long-Short Prompt Memory (LSPM) no independent evidence
purpose: Module with short memory for agile adaptation and long memory for stable global prompts
Newly proposed module to balance adaptation speed and stability.
Densely Supervised Prompt Update (DSPU) no independent evidence
purpose: Strategy to optimize prompt embeddings using enhanced pseudo-labels as dense supervision
Newly proposed update strategy.

pith-pipeline@v0.9.1-grok · 5828 in / 1548 out tokens · 30941 ms · 2026-06-26T08:53:10.257528+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 2 linked inside Pith

[1]

arXiv preprint arXiv:2511.16719 (2025)

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Al- wala, K.V., Khedr, H., Huang, A., et al.: SAM 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719 (2025)

Pith/arXiv arXiv 2025
[2]

In: CVPR

Chen, Z., Pan, Y., Ye, Y., Lu, M., Xia, Y.: Each test image deserves a specific prompt: Continual test-time adaptation for 2d medical image segmentation. In: CVPR. pp. 11184–11193 (2024)

2024
[3]

arXiv preprint arXiv:1902.03368 (2019)

Codella, N., Rotemberg, V., Tschandl, P., Celebi, M.E., Dusza, S., Gutman, D., Helba,B.,Kalloo,A.,Liopyris,K.,Marchetti,M.,etal.:Skinlesionanalysistoward melanoma detection 2018: A challenge hosted by the International Skin Imaging Collaboration (ISIC). arXiv preprint arXiv:1902.03368 (2019)

Pith/arXiv arXiv 2018
[4]

NeurIPS37, 129062–129093 (2024)

Farina, M., Franchi, G., Iacca, G., Mancini, M., Ricci, E.: Frustratingly easy test- time adaptation of vision-language models. NeurIPS37, 129062–129093 (2024)

2024
[5]

Medical image analysis 72, 102136 (2021)

He, Y., Carass, A., Zuo, L., Dewey, B.E., Prince, J.L.: Autoencoder based self- supervised test-time adaptation for medical image analysis. Medical image analysis 72, 102136 (2021)

2021
[6]

In: WACV

Imam, R., Gani, H., Huzaifa, M., Nandakumar, K.: Test-time low rank adaptation via confidence maximization for zero-shot generalization of vision-language models. In: WACV. pp. 5449–5459. IEEE (2025)

2025
[7]

arXiv preprint arXiv:2601.10880 (2026)

Jiang, C., Ding, T., Song, C., Tu, J., Yan, Z., Shao, Y., Wang, Z., Shang, Y., Han, T., Tian, Y.: Medical SAM3: A foundation model for universal prompt-driven medical image segmentation. arXiv preprint arXiv:2601.10880 (2026)

arXiv 2026
[8]

arXiv preprint arXiv:2202.10054 (2022)

Kumar, A., Raghunathan, A., Jones, R., Ma, T., Liang, P.: Fine-tuning can distort pretrained features and underperform out-of-distribution. arXiv preprint arXiv:2202.10054 (2022)

arXiv 2022
[9]

In: CVPR

Lee, T., Chottananurak, S., Gong, T., Lee, S.J.: Aetta: Label-free accuracy esti- mation for test-time adaptation. In: CVPR. pp. 28643–28652 (2024)

2024
[10]

In: MICCAI

Li, X., Fang, H., Wang, C., Liu, M., Duan, L., Xu, Y.: Cache-driven spatial test- time adaptation for cross-modality medical image segmentation. In: MICCAI. pp. 146–156. Springer (2024)

2024
[11]

International Journal of Computer Vision133(1), 31–64 (2025)

Liang, J., He, R., Tan, T.: A comprehensive survey on test-time adaptation under distribution shifts. International Journal of Computer Vision133(1), 31–64 (2025)

2025
[12]

Medical image analysis 18(2), 359–373 (2014) 10 Y

Litjens, G., Toth, R., Van De Ven, W., Hoeks, C., Kerkstra, S., Van Ginneken, B., Vincent, G., Guillard, G., Birbeck, N., Zhang, J., et al.: Evaluation of prostate seg- mentation algorithms for MRI: the PROMISE12 challenge. Medical image analysis 18(2), 359–373 (2014) 10 Y. Zhou et al

2014
[13]

arXiv preprint arXiv:2511.19046 (2025)

Liu, A., Xue, R., Cao, X.R., Shen, Y., Lu, Y., Li, X., Chen, Q., Chen, J.: Med- SAM3: Delving into segment anything with medical concepts. arXiv preprint arXiv:2511.19046 (2025)

arXiv 2025
[14]

In: AAAI

Liu, Q., Chen, C., Dou, Q., Heng, P.A.: Single-domain generalization in medical image segmentation via test-time adaptation from shape dictionary. In: AAAI. vol. 36, pp. 1756–1764 (2022)

2022
[15]

Medical image analysis83, 102641 (2023)

Liu, X., Xing, F., El Fakhri, G., Woo, J.: Memory consistent unsupervised off-the- shelf model adaptation for source-relaxed medical image segmentation. Medical image analysis83, 102641 (2023)

2023
[16]

In: NeurIPS

Lu, Y., Xu, J., Peng, Z., Li, R., Zhang, R., et al.: Historical test-time prompt tuning for vision foundation models. In: NeurIPS. vol. 37, pp. 12872–12896 (2024)

2024
[17]

IEEE Transactions on Medical Imaging41(6), 1560– 1574 (2022)

Mishra, S., Zhang, Y., Chen, D.Z., Hu, X.S.: Data-driven deep supervision for medical image segmentation. IEEE Transactions on Medical Imaging41(6), 1560– 1574 (2022)

2022
[18]

Journal of Medical Physics35(1), 3–14 (2010)

Sharma, N., Aggarwal, L.M.: Automated medical image segmentation techniques. Journal of Medical Physics35(1), 3–14 (2010)

2010
[19]

In: NeurIPS (2025)

Sheng, L., Liang, J., He, R., Wang, Z., Tan, T.: The illusion of progress? A critical look at test-time adaptation for vision-language models. In: NeurIPS (2025)

2025
[20]

In: NeurIPS

Shu, M., Nie, W., Huang, D.A., Yu, Z., Goldstein, T., Anandkumar, A., Yuan, L.: Test-time prompt tuning for zero-shot generalization in vision-language models. In: NeurIPS. vol. 35, pp. 14274–14289 (2022)

2022
[21]

NeurIPS30 (2017)

Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. NeurIPS30 (2017)

2017
[22]

In: ICLR (2021)

Wang, D., Shelhamer, E., Liu, S., Olshausen, B., Darrell, T.: Tent: Fully test-time adaptation by entropy minimization. In: ICLR (2021)

2021
[23]

In: CVPR

Wang, Q., Fink, O., Van Gool, L., Dai, D.: Continual test-time domain adaptation. In: CVPR. pp. 7201–7211 (2022)

2022
[24]

IET image processing16(5), 1243– 1267 (2022)

Wang, R., Lei, T., Cui, R., Zhang, B., Meng, H., Nandi, A.K.: Medical image segmentation using deep learning: A survey. IET image processing16(5), 1243– 1267 (2022)

2022
[25]

In: CVPR

Wortsman, M., Ilharco, G., Kim, J.W., Li, M., Kornblith, S., Roelofs, R., Lopes, R.G., Hajishirzi, H., Farhadi, A., Namkoong, H., et al.: Robust fine-tuning of zero- shot models. In: CVPR. pp. 7959–7971 (2022)

2022
[26]

IEEE Transactions on Medical Imaging43(9), 3098–3109 (2024)

Wu, J., Guo, D., Wang, G., Yue, Q., Yu, H., Li, K., Zhang, S.: Fpl+: Filtered pseudo label-based unsupervised cross-modality adaptation for 3d medical image segmentation. IEEE Transactions on Medical Imaging43(9), 3098–3109 (2024)

2024
[27]

In: ICLR (2025)

Xiao, Z., Yan, S., Hong, J., Cai, J., Jiang, X., Hu, Y., Shen, J., Wang, Q., Snoek, C.G.M.: Dynaprompt: Dynamic test-time prompt tuning. In: ICLR (2025)

2025
[28]

Medical Image Analysis88, 102873 (2023)

Xu, X., Chen, Y., Wu, J., Lu, J., Ye, Y., Huang, Y., Dou, X., Li, K., Wang, G., Zhang, S., Gong, W.: A novel one-to-multiple unsupervised domain adaptation framework for abdominal organ segmentation. Medical Image Analysis88, 102873 (2023)

2023
[29]

IEEE Transactions on Medical Imaging41(12), 3575–3586 (2022)

Yang, H., Chen, C., Jiang, M., Liu, Q., Cao, J., Heng, P.A., Dou, Q.: DLTTA: Dynamic learning rate for test-time adaptation on cross-domain medical images. IEEE Transactions on Medical Imaging41(12), 3575–3586 (2022)

2022

[1] [1]

arXiv preprint arXiv:2511.16719 (2025)

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Al- wala, K.V., Khedr, H., Huang, A., et al.: SAM 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719 (2025)

Pith/arXiv arXiv 2025

[2] [2]

In: CVPR

Chen, Z., Pan, Y., Ye, Y., Lu, M., Xia, Y.: Each test image deserves a specific prompt: Continual test-time adaptation for 2d medical image segmentation. In: CVPR. pp. 11184–11193 (2024)

2024

[3] [3]

arXiv preprint arXiv:1902.03368 (2019)

Codella, N., Rotemberg, V., Tschandl, P., Celebi, M.E., Dusza, S., Gutman, D., Helba,B.,Kalloo,A.,Liopyris,K.,Marchetti,M.,etal.:Skinlesionanalysistoward melanoma detection 2018: A challenge hosted by the International Skin Imaging Collaboration (ISIC). arXiv preprint arXiv:1902.03368 (2019)

Pith/arXiv arXiv 2018

[4] [4]

NeurIPS37, 129062–129093 (2024)

Farina, M., Franchi, G., Iacca, G., Mancini, M., Ricci, E.: Frustratingly easy test- time adaptation of vision-language models. NeurIPS37, 129062–129093 (2024)

2024

[5] [5]

Medical image analysis 72, 102136 (2021)

He, Y., Carass, A., Zuo, L., Dewey, B.E., Prince, J.L.: Autoencoder based self- supervised test-time adaptation for medical image analysis. Medical image analysis 72, 102136 (2021)

2021

[6] [6]

In: WACV

Imam, R., Gani, H., Huzaifa, M., Nandakumar, K.: Test-time low rank adaptation via confidence maximization for zero-shot generalization of vision-language models. In: WACV. pp. 5449–5459. IEEE (2025)

2025

[7] [7]

arXiv preprint arXiv:2601.10880 (2026)

Jiang, C., Ding, T., Song, C., Tu, J., Yan, Z., Shao, Y., Wang, Z., Shang, Y., Han, T., Tian, Y.: Medical SAM3: A foundation model for universal prompt-driven medical image segmentation. arXiv preprint arXiv:2601.10880 (2026)

arXiv 2026

[8] [8]

arXiv preprint arXiv:2202.10054 (2022)

Kumar, A., Raghunathan, A., Jones, R., Ma, T., Liang, P.: Fine-tuning can distort pretrained features and underperform out-of-distribution. arXiv preprint arXiv:2202.10054 (2022)

arXiv 2022

[9] [9]

In: CVPR

Lee, T., Chottananurak, S., Gong, T., Lee, S.J.: Aetta: Label-free accuracy esti- mation for test-time adaptation. In: CVPR. pp. 28643–28652 (2024)

2024

[10] [10]

In: MICCAI

Li, X., Fang, H., Wang, C., Liu, M., Duan, L., Xu, Y.: Cache-driven spatial test- time adaptation for cross-modality medical image segmentation. In: MICCAI. pp. 146–156. Springer (2024)

2024

[11] [11]

International Journal of Computer Vision133(1), 31–64 (2025)

Liang, J., He, R., Tan, T.: A comprehensive survey on test-time adaptation under distribution shifts. International Journal of Computer Vision133(1), 31–64 (2025)

2025

[12] [12]

Medical image analysis 18(2), 359–373 (2014) 10 Y

Litjens, G., Toth, R., Van De Ven, W., Hoeks, C., Kerkstra, S., Van Ginneken, B., Vincent, G., Guillard, G., Birbeck, N., Zhang, J., et al.: Evaluation of prostate seg- mentation algorithms for MRI: the PROMISE12 challenge. Medical image analysis 18(2), 359–373 (2014) 10 Y. Zhou et al

2014

[13] [13]

arXiv preprint arXiv:2511.19046 (2025)

Liu, A., Xue, R., Cao, X.R., Shen, Y., Lu, Y., Li, X., Chen, Q., Chen, J.: Med- SAM3: Delving into segment anything with medical concepts. arXiv preprint arXiv:2511.19046 (2025)

arXiv 2025

[14] [14]

In: AAAI

Liu, Q., Chen, C., Dou, Q., Heng, P.A.: Single-domain generalization in medical image segmentation via test-time adaptation from shape dictionary. In: AAAI. vol. 36, pp. 1756–1764 (2022)

2022

[15] [15]

Medical image analysis83, 102641 (2023)

Liu, X., Xing, F., El Fakhri, G., Woo, J.: Memory consistent unsupervised off-the- shelf model adaptation for source-relaxed medical image segmentation. Medical image analysis83, 102641 (2023)

2023

[16] [16]

In: NeurIPS

Lu, Y., Xu, J., Peng, Z., Li, R., Zhang, R., et al.: Historical test-time prompt tuning for vision foundation models. In: NeurIPS. vol. 37, pp. 12872–12896 (2024)

2024

[17] [17]

IEEE Transactions on Medical Imaging41(6), 1560– 1574 (2022)

Mishra, S., Zhang, Y., Chen, D.Z., Hu, X.S.: Data-driven deep supervision for medical image segmentation. IEEE Transactions on Medical Imaging41(6), 1560– 1574 (2022)

2022

[18] [18]

Journal of Medical Physics35(1), 3–14 (2010)

Sharma, N., Aggarwal, L.M.: Automated medical image segmentation techniques. Journal of Medical Physics35(1), 3–14 (2010)

2010

[19] [19]

In: NeurIPS (2025)

Sheng, L., Liang, J., He, R., Wang, Z., Tan, T.: The illusion of progress? A critical look at test-time adaptation for vision-language models. In: NeurIPS (2025)

2025

[20] [20]

In: NeurIPS

Shu, M., Nie, W., Huang, D.A., Yu, Z., Goldstein, T., Anandkumar, A., Yuan, L.: Test-time prompt tuning for zero-shot generalization in vision-language models. In: NeurIPS. vol. 35, pp. 14274–14289 (2022)

2022

[21] [21]

NeurIPS30 (2017)

Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. NeurIPS30 (2017)

2017

[22] [22]

In: ICLR (2021)

Wang, D., Shelhamer, E., Liu, S., Olshausen, B., Darrell, T.: Tent: Fully test-time adaptation by entropy minimization. In: ICLR (2021)

2021

[23] [23]

In: CVPR

Wang, Q., Fink, O., Van Gool, L., Dai, D.: Continual test-time domain adaptation. In: CVPR. pp. 7201–7211 (2022)

2022

[24] [24]

IET image processing16(5), 1243– 1267 (2022)

Wang, R., Lei, T., Cui, R., Zhang, B., Meng, H., Nandi, A.K.: Medical image segmentation using deep learning: A survey. IET image processing16(5), 1243– 1267 (2022)

2022

[25] [25]

In: CVPR

Wortsman, M., Ilharco, G., Kim, J.W., Li, M., Kornblith, S., Roelofs, R., Lopes, R.G., Hajishirzi, H., Farhadi, A., Namkoong, H., et al.: Robust fine-tuning of zero- shot models. In: CVPR. pp. 7959–7971 (2022)

2022

[26] [26]

IEEE Transactions on Medical Imaging43(9), 3098–3109 (2024)

Wu, J., Guo, D., Wang, G., Yue, Q., Yu, H., Li, K., Zhang, S.: Fpl+: Filtered pseudo label-based unsupervised cross-modality adaptation for 3d medical image segmentation. IEEE Transactions on Medical Imaging43(9), 3098–3109 (2024)

2024

[27] [27]

In: ICLR (2025)

Xiao, Z., Yan, S., Hong, J., Cai, J., Jiang, X., Hu, Y., Shen, J., Wang, Q., Snoek, C.G.M.: Dynaprompt: Dynamic test-time prompt tuning. In: ICLR (2025)

2025

[28] [28]

Medical Image Analysis88, 102873 (2023)

Xu, X., Chen, Y., Wu, J., Lu, J., Ye, Y., Huang, Y., Dou, X., Li, K., Wang, G., Zhang, S., Gong, W.: A novel one-to-multiple unsupervised domain adaptation framework for abdominal organ segmentation. Medical Image Analysis88, 102873 (2023)

2023

[29] [29]

IEEE Transactions on Medical Imaging41(12), 3575–3586 (2022)

Yang, H., Chen, C., Jiang, M., Liu, Q., Cao, J., Heng, P.A., Dou, Q.: DLTTA: Dynamic learning rate for test-time adaptation on cross-domain medical images. IEEE Transactions on Medical Imaging41(12), 3575–3586 (2022)

2022