Concept Alignment Contrast and Long-Short Prompt Memory for Test-Time Adaptation of SAM3 in Medical Image Segmentation
Pith reviewed 2026-06-26 08:53 UTC · model grok-4.3
The pith
CM-TTA adapts SAM3 to medical images by selecting reliable augmentations through textual-visual consistency and using dual prompt memory for stable updates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that textual-visual semantic consistency, measured by the Concept Alignment Contrast metric, can identify high-quality region predictions among multiple augmentations of a test image; these selections, together with a Long-Short Prompt Memory that fuses recent prompts dynamically while preserving a stable global prompt, supply enhanced pseudo-labels that serve as dense supervision for one-pass prompt optimization, enabling SAM3 to adapt effectively to medical domains without annotations.
What carries the argument
The Concept Alignment Contrast (CAC) metric, which uses textual-visual semantic consistency to rank prediction quality, and the Long-Short Prompt Memory (LSPM) module, which maintains both short-term dynamic prompts and a long-term stable prompt for pseudo-label generation.
If this is right
- The selected augmented views provide denser, region-level supervision that improves over image-level uncertainty methods.
- The long memory component prevents drift during continual one-pass adaptation across sequential test samples.
- Prompt embeddings can be updated directly with the enhanced pseudo-labels produced by the combined short and long memories.
- Performance gains appear on both prostate and skin lesion segmentation tasks.
Where Pith is reading between the lines
- The same consistency signal could be tested on other vision-language segmentation models facing domain shifts.
- If medical textual descriptions are sparse or imprecise, the CAC metric may lose reliability and require alternative concept sources.
- Applying the framework to additional modalities such as MRI or ultrasound would test whether the memory balance generalizes beyond the reported datasets.
Load-bearing premise
Textual-visual semantic consistency can reliably indicate region-level prediction correctness in medical images and therefore supply usable supervision.
What would settle it
If CAC-selected augmentations show no higher overlap with ground-truth masks than randomly chosen views on a held-out medical segmentation dataset, the quality-evaluation step would be shown to be ineffective.
Figures
read the original abstract
Concept segmentation models like Segment Anything Model 3 (SAM3) show strong generalization on natural images, yet their performance degrades in medical imaging due to the domain gap caused by different imaging principles and styles. Test-Time Adaptation (TTA) is essential for improving the testing performance by updating the model on the fly without annotations. However, existing vision-language TTA methods are mainly driven by image-level uncertainty minimization, which does not necessarily reflect region-level semantic correctness in medical segmentation. Moreover, they often lack mechanisms to maintain stability in continual one-pass adaptation, leading to limited performance when reliable dense supervision is missing for segmentation. To address these issues, we propose Concept Alignment Contrast and LongShort Prompt Memory for Test-Time Adaptation (CM-TTA) of SAM3 for medical images. First, for a test sample with multiple augmentations, we introduce a novel Concept Alignment Contrast (CAC) metric, which leverages textual-visual semantic consistency to robustly evaluate prediction quality to select the best augmented view as the supervision. Second, to balance rapid and stable adaptation, we design a Long-Short Prompt Memory (LSPM) module. The short memory dynamically fuses recent prompts based on CAC scores for agile local adaptation, while the long memory maintains a stable global prompt to generate enhanced pseudo-labels. Finally, a Densely Supervised Prompt Update (DSPU) strategy is proposed to optimize the prompt embeddings with enhanced pseudo labels as dense supervision. Extensive experiments on prostate and skin lesion segmentation demonstrate that our CM-TTA framework significantly outperforms existing methods for TTA of SAM3.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes the CM-TTA framework for test-time adaptation of SAM3 in medical image segmentation. It introduces Concept Alignment Contrast (CAC) to select the best augmented view via textual-visual semantic consistency, Long-Short Prompt Memory (LSPM) to fuse recent and global prompts for agile yet stable adaptation, and Densely Supervised Prompt Update (DSPU) to optimize prompt embeddings using enhanced pseudo-labels. The central claim is that extensive experiments on prostate and skin lesion segmentation demonstrate significant outperformance over existing TTA methods for SAM3.
Significance. If the empirical results hold, the framework offers a region-level supervision mechanism for TTA that addresses limitations of image-level uncertainty minimization in prior vision-language methods. This could improve adaptation of foundation models like SAM3 to medical domains without annotations. The work explicitly targets gaps in stability during continual one-pass adaptation.
major comments (2)
- Abstract: the claim that 'extensive experiments on prostate and skin lesion segmentation demonstrate that our CM-TTA framework significantly outperforms existing methods' supplies no quantitative results, baselines, error bars, dataset details, or ablation studies, rendering the central empirical claim unverifiable from the provided text.
- Method section describing CAC: the metric assumes cosine similarity between SAM3 region embeddings and medical concept text (e.g., 'prostate gland') robustly ranks region-level prediction quality. Because SAM3 and the underlying VLM were trained predominantly on natural images, this correlation may not hold under medical domain shift (MRI intensity patterns or dermoscopy texture), propagating noisy pseudo-labels into DSPU and undermining the outperformance claim. Direct validation (e.g., CAC vs. ground-truth Dice correlation on held-out data) is required.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the presentation and empirical support without altering the core contributions.
read point-by-point responses
-
Referee: Abstract: the claim that 'extensive experiments on prostate and skin lesion segmentation demonstrate that our CM-TTA framework significantly outperforms existing methods' supplies no quantitative results, baselines, error bars, dataset details, or ablation studies, rendering the central empirical claim unverifiable from the provided text.
Authors: We agree that the abstract's empirical claim would benefit from added specificity for immediate verifiability. In the revised manuscript we will incorporate concise quantitative highlights (e.g., mean Dice improvements and standard deviations over the listed baselines on the two datasets) while preserving the abstract's brevity. revision: yes
-
Referee: Method section describing CAC: the metric assumes cosine similarity between SAM3 region embeddings and medical concept text (e.g., 'prostate gland') robustly ranks region-level prediction quality. Because SAM3 and the underlying VLM were trained predominantly on natural images, this correlation may not hold under medical domain shift (MRI intensity patterns or dermoscopy texture), propagating noisy pseudo-labels into DSPU and undermining the outperformance claim. Direct validation (e.g., CAC vs. ground-truth Dice correlation on held-out data) is required.
Authors: The concern about domain shift is well-taken and highlights a potential limitation of relying on cross-modal similarity. Our current experiments already show that CAC yields higher final segmentation accuracy than uncertainty baselines and that ablations removing CAC degrade performance; however, we do not presently report an explicit correlation plot of CAC scores versus ground-truth Dice on held-out data. We will add this analysis in the revision using the available validation splits to directly quantify the metric's reliability under the observed domain shift. revision: yes
Circularity Check
No circularity: empirical TTA framework with independent components
full rationale
The paper describes an empirical test-time adaptation method (CM-TTA) built from three proposed modules (CAC metric, LSPM, DSPU) that are motivated by stated gaps in prior TTA literature and validated via experiments on prostate and skin lesion data. No equations, derivations, or parameter-fitting steps are presented that reduce by construction to the inputs; the central claims rest on the assumption that textual-visual consistency can serve as supervision, which is an external modeling choice rather than a self-referential definition. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Textual-visual semantic consistency reliably indicates region-level semantic correctness for medical image predictions
invented entities (3)
-
Concept Alignment Contrast (CAC)
no independent evidence
-
Long-Short Prompt Memory (LSPM)
no independent evidence
-
Densely Supervised Prompt Update (DSPU)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2511.16719 (2025)
Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Al- wala, K.V., Khedr, H., Huang, A., et al.: SAM 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719 (2025)
Pith/arXiv arXiv 2025
-
[2]
In: CVPR
Chen, Z., Pan, Y., Ye, Y., Lu, M., Xia, Y.: Each test image deserves a specific prompt: Continual test-time adaptation for 2d medical image segmentation. In: CVPR. pp. 11184–11193 (2024)
2024
-
[3]
arXiv preprint arXiv:1902.03368 (2019)
Codella, N., Rotemberg, V., Tschandl, P., Celebi, M.E., Dusza, S., Gutman, D., Helba,B.,Kalloo,A.,Liopyris,K.,Marchetti,M.,etal.:Skinlesionanalysistoward melanoma detection 2018: A challenge hosted by the International Skin Imaging Collaboration (ISIC). arXiv preprint arXiv:1902.03368 (2019)
Pith/arXiv arXiv 2018
-
[4]
NeurIPS37, 129062–129093 (2024)
Farina, M., Franchi, G., Iacca, G., Mancini, M., Ricci, E.: Frustratingly easy test- time adaptation of vision-language models. NeurIPS37, 129062–129093 (2024)
2024
-
[5]
Medical image analysis 72, 102136 (2021)
He, Y., Carass, A., Zuo, L., Dewey, B.E., Prince, J.L.: Autoencoder based self- supervised test-time adaptation for medical image analysis. Medical image analysis 72, 102136 (2021)
2021
-
[6]
In: WACV
Imam, R., Gani, H., Huzaifa, M., Nandakumar, K.: Test-time low rank adaptation via confidence maximization for zero-shot generalization of vision-language models. In: WACV. pp. 5449–5459. IEEE (2025)
2025
-
[7]
arXiv preprint arXiv:2601.10880 (2026)
Jiang, C., Ding, T., Song, C., Tu, J., Yan, Z., Shao, Y., Wang, Z., Shang, Y., Han, T., Tian, Y.: Medical SAM3: A foundation model for universal prompt-driven medical image segmentation. arXiv preprint arXiv:2601.10880 (2026)
arXiv 2026
-
[8]
arXiv preprint arXiv:2202.10054 (2022)
Kumar, A., Raghunathan, A., Jones, R., Ma, T., Liang, P.: Fine-tuning can distort pretrained features and underperform out-of-distribution. arXiv preprint arXiv:2202.10054 (2022)
arXiv 2022
-
[9]
In: CVPR
Lee, T., Chottananurak, S., Gong, T., Lee, S.J.: Aetta: Label-free accuracy esti- mation for test-time adaptation. In: CVPR. pp. 28643–28652 (2024)
2024
-
[10]
In: MICCAI
Li, X., Fang, H., Wang, C., Liu, M., Duan, L., Xu, Y.: Cache-driven spatial test- time adaptation for cross-modality medical image segmentation. In: MICCAI. pp. 146–156. Springer (2024)
2024
-
[11]
International Journal of Computer Vision133(1), 31–64 (2025)
Liang, J., He, R., Tan, T.: A comprehensive survey on test-time adaptation under distribution shifts. International Journal of Computer Vision133(1), 31–64 (2025)
2025
-
[12]
Medical image analysis 18(2), 359–373 (2014) 10 Y
Litjens, G., Toth, R., Van De Ven, W., Hoeks, C., Kerkstra, S., Van Ginneken, B., Vincent, G., Guillard, G., Birbeck, N., Zhang, J., et al.: Evaluation of prostate seg- mentation algorithms for MRI: the PROMISE12 challenge. Medical image analysis 18(2), 359–373 (2014) 10 Y. Zhou et al
2014
-
[13]
arXiv preprint arXiv:2511.19046 (2025)
Liu, A., Xue, R., Cao, X.R., Shen, Y., Lu, Y., Li, X., Chen, Q., Chen, J.: Med- SAM3: Delving into segment anything with medical concepts. arXiv preprint arXiv:2511.19046 (2025)
arXiv 2025
-
[14]
In: AAAI
Liu, Q., Chen, C., Dou, Q., Heng, P.A.: Single-domain generalization in medical image segmentation via test-time adaptation from shape dictionary. In: AAAI. vol. 36, pp. 1756–1764 (2022)
2022
-
[15]
Medical image analysis83, 102641 (2023)
Liu, X., Xing, F., El Fakhri, G., Woo, J.: Memory consistent unsupervised off-the- shelf model adaptation for source-relaxed medical image segmentation. Medical image analysis83, 102641 (2023)
2023
-
[16]
In: NeurIPS
Lu, Y., Xu, J., Peng, Z., Li, R., Zhang, R., et al.: Historical test-time prompt tuning for vision foundation models. In: NeurIPS. vol. 37, pp. 12872–12896 (2024)
2024
-
[17]
IEEE Transactions on Medical Imaging41(6), 1560– 1574 (2022)
Mishra, S., Zhang, Y., Chen, D.Z., Hu, X.S.: Data-driven deep supervision for medical image segmentation. IEEE Transactions on Medical Imaging41(6), 1560– 1574 (2022)
2022
-
[18]
Journal of Medical Physics35(1), 3–14 (2010)
Sharma, N., Aggarwal, L.M.: Automated medical image segmentation techniques. Journal of Medical Physics35(1), 3–14 (2010)
2010
-
[19]
In: NeurIPS (2025)
Sheng, L., Liang, J., He, R., Wang, Z., Tan, T.: The illusion of progress? A critical look at test-time adaptation for vision-language models. In: NeurIPS (2025)
2025
-
[20]
In: NeurIPS
Shu, M., Nie, W., Huang, D.A., Yu, Z., Goldstein, T., Anandkumar, A., Yuan, L.: Test-time prompt tuning for zero-shot generalization in vision-language models. In: NeurIPS. vol. 35, pp. 14274–14289 (2022)
2022
-
[21]
NeurIPS30 (2017)
Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. NeurIPS30 (2017)
2017
-
[22]
In: ICLR (2021)
Wang, D., Shelhamer, E., Liu, S., Olshausen, B., Darrell, T.: Tent: Fully test-time adaptation by entropy minimization. In: ICLR (2021)
2021
-
[23]
In: CVPR
Wang, Q., Fink, O., Van Gool, L., Dai, D.: Continual test-time domain adaptation. In: CVPR. pp. 7201–7211 (2022)
2022
-
[24]
IET image processing16(5), 1243– 1267 (2022)
Wang, R., Lei, T., Cui, R., Zhang, B., Meng, H., Nandi, A.K.: Medical image segmentation using deep learning: A survey. IET image processing16(5), 1243– 1267 (2022)
2022
-
[25]
In: CVPR
Wortsman, M., Ilharco, G., Kim, J.W., Li, M., Kornblith, S., Roelofs, R., Lopes, R.G., Hajishirzi, H., Farhadi, A., Namkoong, H., et al.: Robust fine-tuning of zero- shot models. In: CVPR. pp. 7959–7971 (2022)
2022
-
[26]
IEEE Transactions on Medical Imaging43(9), 3098–3109 (2024)
Wu, J., Guo, D., Wang, G., Yue, Q., Yu, H., Li, K., Zhang, S.: Fpl+: Filtered pseudo label-based unsupervised cross-modality adaptation for 3d medical image segmentation. IEEE Transactions on Medical Imaging43(9), 3098–3109 (2024)
2024
-
[27]
In: ICLR (2025)
Xiao, Z., Yan, S., Hong, J., Cai, J., Jiang, X., Hu, Y., Shen, J., Wang, Q., Snoek, C.G.M.: Dynaprompt: Dynamic test-time prompt tuning. In: ICLR (2025)
2025
-
[28]
Medical Image Analysis88, 102873 (2023)
Xu, X., Chen, Y., Wu, J., Lu, J., Ye, Y., Huang, Y., Dou, X., Li, K., Wang, G., Zhang, S., Gong, W.: A novel one-to-multiple unsupervised domain adaptation framework for abdominal organ segmentation. Medical Image Analysis88, 102873 (2023)
2023
-
[29]
IEEE Transactions on Medical Imaging41(12), 3575–3586 (2022)
Yang, H., Chen, C., Jiang, M., Liu, Q., Cao, J., Heng, P.A., Dou, Q.: DLTTA: Dynamic learning rate for test-time adaptation on cross-domain medical images. IEEE Transactions on Medical Imaging41(12), 3575–3586 (2022)
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.