arxiv: 2604.24719 · v1 · submitted 2026-04-27 · 💻 cs.CV

Recognition: unknown

DiffuSAM: Diffusion-Based Prompt-Free SAM2 for Few-Shot and Source-Free Medical Image Segmentation

Tal Grossman , Noa Cahan , Lev Ayzenberg , Hayit Greenspan

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion modelsSAM2 adaptationprompt-free segmentationmedical image segmentationsource-free domain adaptationfew-shot learningCT MRI segmentation

0 comments

The pith

A lightweight diffusion prior generates SAM2 mask embeddings from frozen features for prompt-free medical segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that SAM2, trained on natural images, can be adapted to medical CT and MRI scans without prompts, source data, or backbone fine-tuning. It does so by training a small diffusion model to produce segmentation mask-like embeddings directly from the frozen SAM2 image encoder outputs. These embeddings feed into SAM2's existing mask decoder, and the diffusion process is conditioned on previously segmented slices to maintain volume-wide consistency. This yields competitive results on standard benchmarks under source-free unsupervised domain adaptation and few-shot regimes while keeping training and inference lightweight.

Core claim

DiffuSAM synthesizes SAM2-compatible segmentation mask-like embeddings via a lightweight diffusion prior conditioned on off-the-shelf frozen SAM2 image features and prior slices, then integrates those embeddings into SAM2's mask decoder to produce prompt-free segmentations on medical volumes.

What carries the argument

Lightweight diffusion prior that synthesizes segmentation mask-like embeddings from frozen SAM2 image features, conditioned on previous slices for spatial consistency.

If this is right

Removes the requirement for expert-designed prompts during inference on new medical volumes.
Supports source-free unsupervised domain adaptation without access to the original training data.
Enables few-shot adaptation with limited labeled target examples while keeping the SAM2 backbone frozen.
Maintains spatial consistency across 3D volumes by conditioning each slice on the preceding segmentation output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same embedding-synthesis approach could transfer to other promptable vision models beyond SAM2.
Reducing prompt dependence may lower the annotation burden in clinical workflows where expert time is scarce.
The slice-conditioning mechanism suggests a route to fully automatic 3D segmentation pipelines once a few seed slices are obtained.

Load-bearing premise

A small diffusion model trained only on natural-image SAM2 features can still produce accurate, SAM2-decoder-compatible mask embeddings when applied to medical images without any domain-specific changes to the backbone or decoder.

What would settle it

Running the method on a held-out medical dataset and finding that the generated embeddings produce segmentations whose Dice scores fall substantially below those of a supervised baseline or exhibit visible spatial inconsistencies across slices.

Figures

Figures reproduced from arXiv: 2604.24719 by Hayit Greenspan, Lev Ayzenberg, Noa Cahan, Tal Grossman.

**Figure 1.** Figure 1: An outline of our proposed method. (a) In our proposed network, we add a diffusion prior (in Green) to generate memory embeddings from the image embeddings, thereby removing the need for a user prompt (in Red). During training, the SAM2 components all remain frozen. (b) Diffusion Prior model. During training, a diffusion model takes a memory embedding to which noise has been added, the matching image e… view at source ↗

read the original abstract

Segmentation models such as Segment Anything Model (SAM) and SAM2 achieve strong prompt-driven zero-shot performance. However, their training on natural images limits domain transfer to medical data. Consequently, accurate segmentation typically requires extensive fine-tuning and expert-designed prompts. We propose DiffuSAM, a diffusion-based adaptation of SAM2 for prompt-free medical image segmentation. Our framework synthesizes SAM2-compatible segmentation mask-like embeddings via a lightweight diffusion-prior from off-the-shelf frozen SAM2 image features. The generated embeddings are integrated into SAM2's mask decoder to produce accurate segmentations, thereby eliminating the need for user prompts. The diffusion prior is further conditioned on previously segmented slices, enforcing spatial consistency across volumes. Evaluated on the BTCV and CHAOS datasets for CT and MRI under Source-Free Unsupervised Domain Adaptation (SF-UDA) and Few-Shot settings, DiffuSAM achieves competitive performance with efficient training and inference. Code is available upon request from the corresponding author.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiffuSAM adds a lightweight diffusion prior to frozen SAM2 features to enable prompt-free medical segmentation in source-free and few-shot settings, but the abstract supplies no numbers or ablations so the performance claims stay unverified.

read the letter

The core idea is straightforward: train a small diffusion model on SAM2 image features to output mask-like embeddings that the frozen SAM2 decoder can use directly, with extra conditioning from already-segmented slices to keep consistency across a volume. This removes the need for user prompts and avoids fine-tuning the big backbone or decoder, which is the practical angle for medical work where expert time is limited and source data may not be available.

Referee Report

4 major / 1 minor

Summary. The paper proposes DiffuSAM, a prompt-free adaptation of SAM2 for medical image segmentation. It trains a lightweight diffusion prior on frozen SAM2 image features to synthesize mask-like embeddings, which are fed to the frozen SAM2 mask decoder. The prior is conditioned on embeddings from previously segmented slices to enforce 3D consistency. The method is positioned for source-free unsupervised domain adaptation (SF-UDA) and few-shot regimes and is evaluated on the BTCV and CHAOS datasets for CT and MRI, with the abstract claiming competitive performance and efficient training/inference.

Significance. If the results hold, the work would demonstrate a practical route to deploy natural-image foundation models such as SAM2 on medical volumes without backbone fine-tuning or manual prompts, which are major practical bottlenecks. The diffusion-prior idea for generating decoder-compatible embeddings is a potentially reusable technique for other prompt-based segmenters.

major comments (4)

Abstract: the claim of 'competitive performance' on BTCV and CHAOS under SF-UDA and few-shot settings is unsupported by any numerical results, error bars, baseline comparisons, or metric values, rendering the central empirical claim unverifiable from the manuscript.
Framework description (throughout): no equations, loss functions, or architectural details are provided for the lightweight diffusion prior, its conditioning on prior-slice embeddings, or the training procedure, so it is impossible to assess whether the generated embeddings are actually SAM2-compatible or merely fitted to the training distribution.
Method and evaluation sections: the load-bearing assumption that off-the-shelf frozen SAM2 features (trained exclusively on natural images) contain sufficient semantic structure for accurate medical anatomy segmentation is never tested via ablation (e.g., frozen vs. fine-tuned encoder, or feature-space analysis), despite the acknowledged domain shift in intensity, noise, and lack of color cues.
Evaluation setup: no information is given on data splits, number of shots, training/validation protocol for the diffusion prior, or inference-time 3D consistency enforcement, all of which are required to substantiate the 'efficient training and inference' and cross-setting claims.

minor comments (1)

The statement 'Code is available upon request' should be replaced by a public repository link or DOI to meet standard reproducibility expectations.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address all major concerns by adding missing details, quantitative support, and clarifications. Our point-by-point responses follow.

read point-by-point responses

Referee: Abstract: the claim of 'competitive performance' on BTCV and CHAOS under SF-UDA and few-shot settings is unsupported by any numerical results, error bars, baseline comparisons, or metric values, rendering the central empirical claim unverifiable from the manuscript.

Authors: We agree that the abstract requires concrete numerical backing to substantiate the claim. In the revised manuscript, we have updated the abstract to include specific Dice and HD95 scores on BTCV and CHAOS for both SF-UDA and few-shot settings, along with comparisons to relevant baselines and reference to error bars from repeated runs. revision: yes
Referee: Framework description (throughout): no equations, loss functions, or architectural details are provided for the lightweight diffusion prior, its conditioning on prior-slice embeddings, or the training procedure, so it is impossible to assess whether the generated embeddings are actually SAM2-compatible or merely fitted to the training distribution.

Authors: We acknowledge the absence of these details in the original submission. The revised Method section now provides the full diffusion equations, the training loss (denoising score matching objective), the network architecture of the lightweight prior, and the conditioning mechanism that injects prior-slice embeddings. These additions clarify how the synthesized embeddings remain compatible with the frozen SAM2 mask decoder. revision: yes
Referee: Method and evaluation sections: the load-bearing assumption that off-the-shelf frozen SAM2 features (trained exclusively on natural images) contain sufficient semantic structure for accurate medical anatomy segmentation is never tested via ablation (e.g., frozen vs. fine-tuned encoder, or feature-space analysis), despite the acknowledged domain shift in intensity, noise, and lack of color cues.

Authors: This concern is valid. We have added an ablation study comparing frozen versus fine-tuned SAM2 image encoder performance under identical SF-UDA and few-shot protocols. We also include a brief t-SNE visualization and cosine similarity analysis of feature embeddings to demonstrate that semantic structure for anatomical regions is preserved despite the domain gap, supporting the source-free design. revision: yes
Referee: Evaluation setup: no information is given on data splits, number of shots, training/validation protocol for the diffusion prior, or inference-time 3D consistency enforcement, all of which are required to substantiate the 'efficient training and inference' and cross-setting claims.

Authors: We have substantially expanded the Experimental Setup and Implementation Details sections. The revision now specifies the exact train/validation/test splits for BTCV and CHAOS, the shot counts (e.g., 1-shot, 5-shot), the training/validation protocol and hyperparameters for the diffusion prior, and the slice-by-slice inference procedure with temporal conditioning for 3D consistency. Runtime tables are also added to support the efficiency claims. revision: yes

Circularity Check

0 steps flagged

No circularity; purely descriptive framework with no derivations or equations

full rationale

The provided manuscript text contains only high-level framework descriptions and no equations, derivations, fitted parameters presented as predictions, or self-citation chains that reduce claims to inputs by construction. The approach of training a lightweight diffusion prior on frozen SAM2 features is stated as a design choice without mathematical reduction or uniqueness theorems invoked from prior self-work. This matches the reader's assessment that no performance claims reduce to fitted inputs by construction, qualifying as a self-contained proposal rather than a circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven assumption that SAM2 image features remain sufficiently informative for medical data when passed through a learned diffusion mapping.

axioms (1)

domain assumption SAM2 image encoder produces features that can be mapped to mask embeddings via diffusion for out-of-domain medical images
Invoked in the description of the diffusion prior synthesis step.

invented entities (1)

lightweight diffusion-prior no independent evidence
purpose: Synthesize SAM2-compatible segmentation mask-like embeddings from frozen SAM2 features
New component introduced to eliminate prompts; no independent evidence of its generalizability provided in abstract.

pith-pipeline@v0.9.0 · 5478 in / 1200 out tokens · 41401 ms · 2026-05-08T04:16:04.826856+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 4 canonical work pages · 3 internal anchors

[1]

Deep learning has achieved re- markable success in this task [1], but supervised approaches require large annotated datasets, which are costly and expert- dependent

INTRODUCTION Medical image segmentation is essential for clinical diagno- sis and treatment planning. Deep learning has achieved re- markable success in this task [1], but supervised approaches require large annotated datasets, which are costly and expert- dependent. Foundation models such as SAM [2] and SAM2
[2]

In addition, their reliance on user prompts limits practical use in clinical workflows

exhibit strong zero-shot segmentation on natural images but face challenges when applied to medical data due to modality differences in texture, contrast, and anatomy. In addition, their reliance on user prompts limits practical use in clinical workflows. Previous studies have adapted SAM Corresponding author (e-mail):talgrossman22@gmail.com. † These auth...
[3]

METHOD An overview of DiffuSAM is shown in Fig. 1. (a). Our model was inspired by [8], in which a diffusion-prior was trained for cross-modal embedding generation for classification en- hancement. During training, we fit a lightweight diffusion arXiv:2604.24719v1 [cs.CV] 27 Apr 2026 Fig. 1: An outline of our proposed method. (a) In our pro- posed network,...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Dataset and setup We evaluate our method on abdominal imaging datasets

EXPERIMENTS 3.1. Dataset and setup We evaluate our method on abdominal imaging datasets. The Beyond the Cranial Vault (BTCV) dataset [9] consists of 30 CT volumes, while the CHAOS dataset [10] comprises 20 T2- SPIR MRI volumes. Both datasets provide multi-organ anno- tations, specifically, we focus on segmenting the spleen, right kidney, left kidney, and ...
[5]

We acknowledge that MedSAM [4], trained end-to-end on large medical datasets and optimized for prompts, can outperform ours

DISCUSSION AND CONCLUSION We showed that a lightweight diffusion prior over frozen SAM2 features enables prompt-free segmentation with com- petitive results in few-shot and SF-UDA, with simple training and fast inference. We acknowledge that MedSAM [4], trained end-to-end on large medical datasets and optimized for prompts, can outperform ours. This is ex...
[6]

U-net: Convo- lutional networks for biomedical image segmentation,

O. Ronneberge, P. Fischer, and T. Brox, “U-net: Convo- lutional networks for biomedical image segmentation,” inMedical Image Computing and Computer-Assisted Intervention (MICCAI). 2015, vol. 9351 ofLNCS, pp. 234–241, Springer

2015
[7]

Segment Anything

A. Kirillov, E. Mintun, N. Ravi, et al., “Segment any- thing,”arXiv:2304.02643, 2023

work page internal anchor Pith review arXiv 2023
[8]

SAM 2: Segment Anything in Images and Videos

N. Ravi, V . Gabeur, Y .T. Hu, et al., “Sam 2: Seg- ment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review arXiv 2024
[9]

Segment anything in medi- cal images,

J. Ma, Y . He, F. Li, et al., “Segment anything in medi- cal images,”Nature Communications, vol. 15, pp. 654, 2024

2024
[10]

Proto- sam: One-shot medical image segmentation with foun- dational models,

L. Ayzenberg, R. Giryes, and H. Greenspan, “Proto- sam: One-shot medical image segmentation with foun- dational models,”arXiv preprint arXiv:2407.07042, 2024

work page arXiv 2024
[11]

Fdiff-fusion: De- noising diffusion fusion network based on fuzzy learn- ing for 3d medical image segmentation,

Z. Ding, Q. Geng, S. Wang, et al., “Fdiff-fusion: De- noising diffusion fusion network based on fuzzy learn- ing for 3d medical image segmentation,”Information Fusion, vol. 112, pp. 102540, 2024

2024
[12]

Leveraging segment anything model for source-free domain adaptation via dual feature guided auto-prompting,

Z. Huai, Tang. H., Y . Li, et al., “Leveraging segment anything model for source-free domain adaptation via dual feature guided auto-prompting,”IEEE Transac- tions on Medical Imaging, pp. 1–1, 2025

2025
[13]

Cross-modal cxr-ctpa knowledge distillation using latent diffusion priors towards cxr pulmonary embolism diagnosis,

N. Cahan, M. Sizikov, and H. Greenspan, “Cross-modal cxr-ctpa knowledge distillation using latent diffusion priors towards cxr pulmonary embolism diagnosis,” in Medical Image Computing and Computer Assisted In- tervention - MICCAI 2025, Proceedings, Part XV. 2025, p. 125–135, Springer-Verlag

2025
[14]

Miccai multi- atlas labeling beyond the cranial vault – workshop and challenge,

B. Landman, Z. Xu, J. Igelsias, et al., “Miccai multi- atlas labeling beyond the cranial vault – workshop and challenge,” inProc. MICCAI Multi-Atlas Labeling Beyond the Cranial V ault Workshop Challenge, 2015, vol. 5, p. 12

2015
[15]

CHAOS Challenge - combined (CT-MR) healthy abdominal or- gan segmentation,

A.E. Kavur, N.S. Gezer, M. Baris, et al., “CHAOS Challenge - combined (CT-MR) healthy abdominal or- gan segmentation,”Medical Image Analysis, vol. 69, pp. 101950, Apr. 2021

2021
[16]

Unetr: Trans- formers for 3d medical image segmentation,

A. Hatamizadeh, Y . Tang, V . Nath, et al., “Unetr: Trans- formers for 3d medical image segmentation,” inPro- ceedings of the IEEE/CVF winter conference on appli- cations of computer vision, 2022, pp. 574–584

2022
[17]

Few-shot adap- tation of training-free foundation model for 3d medical image segmentation,

X. He, Y . Hu, and Z.and others Zhou, “Few-shot adap- tation of training-free foundation model for 3d medical image segmentation,” 2025

2025
[18]

Source-free do- main adaptive fundus image segmentation with de- noised pseudo-labeling,

C. Chen, Q. Liu, Y . Jin, et al., “Source-free do- main adaptive fundus image segmentation with de- noised pseudo-labeling,” inMedical Image Computing and Computer Assisted Intervention – MICCAI 2021, Cham, 2021, pp. 225–235, Springer International Pub- lishing

2021