ENSAM: an efficient foundation model for interactive segmentation of 3D medical images

Agnar Martin Bj{\o}rnstad; Arian Ranjbar; Elias Stenhede

arxiv: 2509.15874 · v1 · pith:26METU7Hnew · submitted 2025-09-19 · 💻 cs.CV

ENSAM: an efficient foundation model for interactive segmentation of 3D medical images

Elias Stenhede , Agnar Martin Bj{\o}rnstad , Arian Ranjbar This is my paper

classification 💻 cs.CV

keywords ensamsegmentationfinalmedicalmodelchallengeencoderfoundation

0 comments

read the original abstract

We present ENSAM (Equivariant, Normalized, Segment Anything Model), a lightweight and promptable model for universal 3D medical image segmentation. ENSAM combines a SegResNet-based encoder with a prompt encoder and mask decoder in a U-Net-style architecture, using latent cross-attention, relative positional encoding, normalized attention, and the Muon optimizer for training. ENSAM is designed to achieve good performance under limited data and computational budgets, and is trained from scratch on under 5,000 volumes from multiple modalities (CT, MRI, PET, ultrasound, microscopy) on a single 32 GB GPU in 6 hours. As part of the CVPR 2025 Foundation Models for Interactive 3D Biomedical Image Segmentation Challenge, ENSAM was evaluated on hidden test set with multimodal 3D medical images, obtaining a DSC AUC of 2.404, NSD AUC of 2.266, final DSC of 0.627, and final NSD of 0.597, outperforming two previously published baseline models (VISTA3D, SAM-Med3D) and matching the third (SegVol), surpassing its performance in final DSC but trailing behind in the other three metrics. In the coreset track of the challenge, ENSAM ranks 5th of 10 overall and best among the approaches not utilizing pretrained weights. Ablation studies confirm that our use of relative positional encodings and the Muon optimizer each substantially speed up convergence and improve segmentation quality.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LETT-NeXt: A Lightweight RECIST-Guided Model for 3D CT Lesion Segmentation
cs.CV 2026-06 unverdicted novelty 4.0

LETT-NeXt uses RECIST line prompts in a cropped MedNeXt-v2 encoder-decoder to predict 3D lesion masks, reaching DSC 73.9 on hidden test data for a CVPR 2026 segmentation competition.