Recognition: unknown
ESICA: A Scalable Framework for Text-Guided 3D Medical Image Segmentation
Pith reviewed 2026-05-08 04:18 UTC · model grok-4.3
The pith
ESICA uses similarity matrices and adapter decoders to reach state-of-the-art accuracy in text-guided 3D medical image segmentation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ESICA is a scalable framework that improves semantic alignment and boundary precision in text-guided 3D segmentation by combining a similarity-matrix mask prediction formulation, an efficient decomposed decoder with adapter modules, and a two-pass refinement strategy, trained via positive-only pretraining followed by balanced fine-tuning, resulting in state-of-the-art accuracy across five imaging modalities and a superior efficiency trade-off in its lite variant.
What carries the argument
The similarity-matrix based mask prediction formulation that directly computes alignment between text embeddings and volumetric features to guide segmentation.
If this is right
- Natural language prompts become a practical way to specify regions of interest without predefined label sets.
- Segmentation systems can run on devices with limited memory while retaining high accuracy.
- Boundary refinement reduces ambiguous regions that often appear in clinical 3D scans.
- Two-stage training improves stability when moving from synthetic or limited data to real balanced datasets.
- The framework supports deployment across multiple scanner types in a single model.
Where Pith is reading between the lines
- The same similarity-matrix alignment idea could be tested on text-guided tasks outside segmentation, such as report generation from 3D scans.
- Adapter modules in the decoder might transfer to other volumetric networks to reduce parameter counts without retraining from scratch.
- If the two-pass refinement proves robust, it could be added to existing prompt-based models to improve edge cases in noisy medical data.
Load-bearing premise
The three innovations plus the two-stage training produce genuine semantic alignment and boundary gains rather than improvements tied only to the tested modalities and prompt styles.
What would settle it
A new evaluation set of 3D medical volumes from unseen modalities or with different text prompt phrasing where ESICA and its lite variant no longer exceed prior methods in accuracy or efficiency.
Figures
read the original abstract
Text guided 3D medical image segmentation offers a flexible alternative to class based and spatial prompt based models by allowing users to specify regions of interest directly in natural language. This paradigm avoids reliance on predefined label sets, reduces ambiguous outputs, and aligns more naturally with clinical workflows. However, existing text guided frameworks are often computationally expensive, exhibit weak text volume feature alignment, and fail to capture fine anatomical details. We propose ESICA, a lightweight and scalable framework that addresses these challenges through three innovations: (1) a similarity matrix based mask prediction formulation that enhances semantic alignment, (2) an efficient decomposed decoder with adapter modules for accurate volumetric decoding, and (3) a two pass refinement strategy that sharpens boundaries and resolves uncertain regions. To improve training stability and generalization, ESICA adopts a two stage scheme consisting of positive only pretraining followed by balanced fine tuning. On the CVPR BiomedSegFM benchmark spanning five imaging modalities (CT, MRI, PET, ultrasound, and microscopy), ESICA achieves state of the art segmentation accuracy, while the compact ESICA4 Lite variant attains similar segmentation performance with substantially fewer parameters, yielding a superior efficiency accuracy trade off. Our framework advances text guided segmentation toward efficient, scalable, and clinically deployable systems. Code will be made publicly available at https://github.com/mirthAI/ESICA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents ESICA, a lightweight and scalable framework for text-guided 3D medical image segmentation. It proposes three innovations—a similarity-matrix-based mask prediction formulation to improve semantic alignment, an efficient decomposed decoder with adapter modules for volumetric decoding, and a two-pass refinement strategy to sharpen boundaries—along with a two-stage training scheme (positive-only pretraining followed by balanced fine-tuning). The central empirical claim is state-of-the-art segmentation accuracy on the CVPR BiomedSegFM benchmark across five modalities (CT, MRI, PET, ultrasound, microscopy), with the compact ESICA4 Lite variant achieving comparable performance using substantially fewer parameters and a superior efficiency-accuracy trade-off.
Significance. If the benchmark results hold under scrutiny, the work is significant for advancing text-guided 3D segmentation toward efficient, clinically deployable systems by addressing computational cost and feature alignment limitations in prior frameworks. The multi-modality evaluation and explicit efficiency focus for the Lite variant add practical value, and the planned public code release supports reproducibility.
major comments (2)
- [§5] §5 (Results on CVPR BiomedSegFM benchmark): The SOTA accuracy claim is load-bearing for the contribution, yet the reported Dice/IoU scores lack error bars, standard deviations across runs, or statistical significance tests against baselines; this omission prevents verification that the gains from the similarity-matrix, decomposed decoder, and two-pass refinement are robust rather than benchmark-specific.
- [§4.2] §4.2 (Ablation studies): The ablation tables do not report the impact of each component (similarity matrix, adapters, two-pass refinement) with consistent metrics across all five modalities or with data-exclusion rules, undermining the claim that the three innovations produce genuine semantic and boundary gains.
minor comments (3)
- [Abstract] Abstract: The description of the ESICA4 Lite variant states 'similar segmentation performance with substantially fewer parameters' without quantifying the parameter reduction or the exact accuracy delta; adding these numbers would improve precision.
- [§3.1] §3.1: The similarity-matrix formulation is introduced without an explicit equation showing how the matrix is computed from text and volume features, which would clarify the claimed enhancement in semantic alignment.
- [§2] References: Several recent text-guided segmentation works (post-2023) are missing from the related-work section, which would better situate the novelty of the decomposed decoder and two-stage scheme.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the statistical rigor and completeness of our empirical claims without altering the core contributions.
read point-by-point responses
-
Referee: [§5] §5 (Results on CVPR BiomedSegFM benchmark): The SOTA accuracy claim is load-bearing for the contribution, yet the reported Dice/IoU scores lack error bars, standard deviations across runs, or statistical significance tests against baselines; this omission prevents verification that the gains from the similarity-matrix, decomposed decoder, and two-pass refinement are robust rather than benchmark-specific.
Authors: We agree that the absence of error bars and statistical tests limits the ability to assess robustness. In the revised manuscript we will re-run all experiments with at least five random seeds, report mean Dice and IoU scores together with standard deviations for every modality, and add paired statistical significance tests (Wilcoxon signed-rank) against the strongest baselines. These additions will be placed in §5 and the supplementary material. revision: yes
-
Referee: [§4.2] §4.2 (Ablation studies): The ablation tables do not report the impact of each component (similarity matrix, adapters, two-pass refinement) with consistent metrics across all five modalities or with data-exclusion rules, undermining the claim that the three innovations produce genuine semantic and boundary gains.
Authors: We acknowledge that the current ablation tables are not uniformly reported across all modalities. In the revision we will expand §4.2 to include per-component and cumulative ablations for CT, MRI, PET, ultrasound, and microscopy using identical Dice/IoU metrics. We will also explicitly state the data-exclusion protocol (e.g., which slices or volumes were held out) and add a dedicated table summarizing the incremental contribution of each innovation on every modality. revision: yes
Circularity Check
No significant circularity
full rationale
The paper proposes architectural components (similarity-matrix mask prediction, decomposed decoder with adapters, two-pass refinement) and a two-stage training scheme, then reports empirical segmentation accuracy on the external CVPR BiomedSegFM benchmark. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would make any claimed result equivalent to its inputs by construction. The SOTA claim is a direct benchmark measurement rather than a tautology.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Segment any- thing,
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rol- land, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Dollar, and R. Girshick, “Segment any- thing,” inProceedings of the International Conference on Computer Vision, 2023, pp. 4015–4026. 1, 2, 4
2023
-
[2]
Sam-med3d: Towards general- purpose segmentation models for volumetric medical images,
H. Wang, S. Guo, J. Ye, Z. Deng, J. Cheng, T. Li, J. Chen, Y . Su, Z. Huang, Y . Shen, B. Fu, S. Zhang, J. He, and Y . Qiao, “Sam-med3d: Towards general- purpose segmentation models for volumetric medical images,”arXiv preprint arXiv:2310.15161, 2024. 1
-
[3]
Segment anything in medical images,
J. Ma, Y . He, F. Li, L. Han, C. You, and B. Wang, “Segment anything in medical images,”Nature Com- munications, vol. 15, p. 654, 2024. 1, 2
2024
-
[4]
Medsam2: Segment anything in 3d medical images and videos.arXiv preprint arXiv:2504.03600, 2025
J. Ma, Z. Yang, S. Kim, B. Chen, M. Baharoon, A. Fal- lahpour, R. Asakereh, H. Lyu, and B. Wang, “Med- sam2: Segment anything in 3d medical images and videos,”arXiv preprint arXiv:2504.03600, 2025. 1, 2
-
[5]
Cat: Coordinating anatomical-textual prompts for multi-organ and tumor segmentation,
Z. Huang, Y . Jiang, R. Zhang, S. Zhang, and X. Zhang, “Cat: Coordinating anatomical-textual prompts for multi-organ and tumor segmentation,” inAdvances in Neural Information Processing Systems, A. Glober- son, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37, 2024, pp. 3588–3610. 2, 3
2024
-
[6]
One model to rule them all: To- wards universal segmentation for medical images with text prompt,
Z. Zhao, Y . Zhang, C. Wu, X. Zhang, Y . Zhang, Y . Wang, and W. Xie, “One model to rule them all: To- wards universal segmentation for medical images with text prompt,”arXiv preprint arXiv:2312.17183, 2023. 2, 3
-
[7]
Text3dsam: Text- guided 3d medical image segmentation using sam- inspired architecture,
Y . Xin, G. C. Ates, and W. Shao, “Text3dsam: Text- guided 3d medical image segmentation using sam- inspired architecture,” inCVPR 2025: Foundation Models for 3D Biomedical Image Segmentation, 2025. 2, 3
2025
-
[8]
Effidec3d: An optimized decoder for high-performance and efficient 3d medical image segmentation,
M. M. Rahman and R. Marculescu, “Effidec3d: An optimized decoder for high-performance and efficient 3d medical image segmentation,” inProceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 10 435–10 444. 2, 5
2025
-
[9]
Dcformer: Efficient 3d vision-language model- ing with decomposed convolutions,
G. C. Ates, Y . Xin, K. Gong, and W. Shao, “Dcformer: Efficient 3d vision-language model- ing with decomposed convolutions,”arXiv preprint arXiv:2502.05091, 2025. 2, 3, 5
-
[10]
U-net: Con- volutional networks for biomedical image segmen- tation,
O. Ronneberger, P. Fischer, and T. Brox, “U-net: Con- volutional networks for biomedical image segmen- tation,” inInternational Conference on Medical im- age computing and computer-assisted intervention. Springer, 2015, pp. 234–241. 2
2015
-
[11]
3d u-net: learning dense volu- metric segmentation from sparse annotation,
¨O. C ¸ ic ¸ek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3d u-net: learning dense volu- metric segmentation from sparse annotation,” inInter- national conference on medical image computing and computer-assisted intervention. Springer, 2016, pp. 424–432. 2
2016
-
[12]
Attention U-Net: Learning Where to Look for the Pancreas
O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y . Hammerla, B. Kainzet al., “Attention u-net: Learning where to look for the pancreas,”arXiv preprint arXiv:1804.03999, 2018. 2
work page internal anchor Pith review arXiv 2018
-
[13]
Re- sunet++: An advanced architecture for medical image segmentation,
D. Jha, P. H. Smedsrud, M. A. Riegler, D. Johansen, T. De Lange, P. Halvorsen, and H. D. Johansen, “Re- sunet++: An advanced architecture for medical image segmentation,” in2019 IEEE international symposium on multimedia (ISM). IEEE, 2019, pp. 225–2255. 2
2019
-
[14]
Uc- transnet: rethinking the skip connections in u-net from a channel-wise perspective with transformer,
H. Wang, P. Cao, J. Wang, and O. R. Zaiane, “Uc- transnet: rethinking the skip connections in u-net from a channel-wise perspective with transformer,” inPro- ceedings of the AAAI conference on artificial intelli- gence, vol. 36, no. 3, 2022, pp. 2441–2449. 2
2022
-
[15]
TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation
J. Chen, Y . Lu, Q. Yu, X. Luo, E. Adeli, Y . Wang, L. Lu, A. L. Yuille, and Y . Zhou, “Transunet: Trans- formers make strong encoders for medical image seg- mentation,”arXiv preprint arXiv:2102.04306, 2021. 2
work page internal anchor Pith review arXiv 2021
-
[16]
Swin-unet: Unet-like pure transformer for medical image segmentation,
H. Cao, Y . Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, “Swin-unet: Unet-like pure transformer for medical image segmentation,” inEuropean confer- ence on computer vision. Springer, 2022, pp. 205–
2022
-
[17]
Swinunetr-v2: Stronger swin transform- ers with stagewise convolutions for 3d medical image segmentation,
Y . He, V . Nath, D. Yang, Y . Tang, A. Myronenko, and D. Xu, “Swinunetr-v2: Stronger swin transform- ers with stagewise convolutions for 3d medical image segmentation,” inInternational Conference on Med- ical Image Computing and Computer-Assisted Inter- vention. Springer, 2023, pp. 416–426. 2
2023
-
[18]
Efficient medsams: Segment any- thing in medical images on laptop,
J. Ma, F. Li, S. Kim, R. Asakereh, B.-H. Le, D.-K. Nguyen-Vu, A. Pfefferle, M. Wei, R. Gao, 10 D. Lyuet al., “Efficient medsams: Segment any- thing in medical images on laptop,”arXiv preprint arXiv:2412.16085, 2024. 2
-
[19]
Repvit-medsam: Ef- ficient segment anything in the medical images,
Q. Ali, Y . Chen, and A. Wong, “Repvit-medsam: Ef- ficient segment anything in the medical images,” in Medical Image Segmentation Challenge. Springer, 2024, pp. 195–205. 2
2024
-
[20]
Dsam: A faster sam for 3d med- ical image segmentation,
Z. Tan and Q. Cai, “Dsam: A faster sam for 3d med- ical image segmentation,” in2024 International An- nual Conference on Complex Systems and Intelligent Science (CSIS-IAC). IEEE, 2024, pp. 891–895. 2
2024
-
[21]
Fastsam3d: An effi- cient segment anything model for 3d volumetric med- ical images,
Y . Shen, J. Li, X. Shao, B. Inigo Romillo, A. Jindal, D. Dreizin, and M. Unberath, “Fastsam3d: An effi- cient segment anything model for 3d volumetric med- ical images,” inInternational Conference on Medi- cal Image Computing and Computer-Assisted Inter- vention, 2024, pp. 542–552. 3
2024
-
[22]
Segvol: Uni- versal and interactive volumetric medical image seg- mentation,
Y . Du, F. Bai, T. Huang, and B. Zhao, “Segvol: Uni- versal and interactive volumetric medical image seg- mentation,” inAdvances in Neural Information Pro- cessing Systems, vol. 37, 2024, pp. 110 746–110 783. 3
2024
-
[23]
Image segmentation us- ing text and image prompts,
T. L ¨uddecke and A. Ecker, “Image segmentation us- ing text and image prompts,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 7086–7096. 3
2022
-
[24]
A foundation model for joint segmentation, de- tection and recognition of biomedical objects across nine modalities,
T. Zhao, Y . Gu, J. Yang, N. Usuyama, H. H. Lee, S. Kiblawi, T. Naumann, J. Gao, A. Crabtree, J. Abel et al., “A foundation model for joint segmentation, de- tection and recognition of biomedical objects across nine modalities,”Nature Methods, vol. 22, pp. 166– 176, 2025. 3
2025
-
[25]
Bert: Pre-training of deep bidirectional transform- ers for language understanding,
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transform- ers for language understanding,” inProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human lan- guage technologies, volume 1 (long and short papers), 2019, pp. 4171–4186. 3
2019
-
[26]
Gqa: Training general- ized multi-query transformer models from multi-head checkpoints,
J. Ainslie, J. Lee-Thorp, M. de Jong, Y . Zemlyanskiy, F. Lebron, and S. Sanghai, “Gqa: Training general- ized multi-query transformer models from multi-head checkpoints,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, 2023, pp. 4895–4901. 4
2023
-
[27]
Roformer: Enhanced transformer with ro- tary position embedding,
J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: Enhanced transformer with ro- tary position embedding,”Neurocomputing, vol. 568, p. 127063, 2024. 4
2024
-
[28]
MONAI: An open-source framework for deep learning in healthcare
M. J. Cardoso, W. Li, R. Brown, N. Ma, E. Ker- foot, Y . Wang, B. Murrey, A. Myronenko, C. Zhao, D. Yanget al., “Monai: An open-source frame- work for deep learning in healthcare,”arXiv preprint arXiv:2211.02701, 2022. 6
work page internal anchor Pith review arXiv 2022
-
[29]
Opti- mized glycemic control of type 2 diabetes with rein- forcement learning: a proof-of-concept trial,
G. Wang, X. Liu, Z. Ying, G. Yang, Z. Chen, Z. Liu, M. Zhang, H. Yan, Y . Lu, Y . Gaoet al., “Opti- mized glycemic control of type 2 diabetes with rein- forcement learning: a proof-of-concept trial,”Nature Medicine, vol. 29, no. 10, pp. 2633–2642, 2023. 6
2023
-
[30]
A generalist medical language model for disease diagnosis assis- tance,
X. Liu, H. Liu, G. Yang, Z. Jiang, S. Cui, Z. Zhang, H. Wang, L. Tao, Y . Sun, Z. Songet al., “A generalist medical language model for disease diagnosis assis- tance,”Nature medicine, vol. 31, no. 3, pp. 932–942,
-
[31]
Lightweight transform- ers for clinical natural language processing,
O. Rohanian, M. Nouriborji, H. Jauncey, S. Kouchaki, F. Nooralahzadeh, L. Clifton, L. Merson, D. A. Clifton, I. C. C. Groupet al., “Lightweight transform- ers for clinical natural language processing,”Natural Language Engineering, pp. 1–28, 2023. 6
2023
-
[32]
Muon: An optimizer for hidden layers in neural networks,
K. Jordan, Y . Jin, V . Boza, Y . Jiacheng, F. Cesista, L. Newhouse, and J. Bernstein, “Muon: An optimizer for hidden layers in neural networks,” 2024. [Online]. Available: https://kellerjordan.github.io/posts/muon/ 6
2024
-
[33]
Deepspeed: System optimizations enable training deep learning models with over 100 billion parame- ters,
J. Rasley, S. Rajbhandari, O. Ruwase, and Y . He, “Deepspeed: System optimizations enable training deep learning models with over 100 billion parame- ters,” inProceedings of the 26th ACM SIGKDD inter- national conference on knowledge discovery & data mining, 2020, pp. 3505–3506. 6 11
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.