Does Your ViT Still Need U-Net for Segmentation?

Hao Wang; Oana M. Dumitrascu; Wenhui Zhu; Xin Li; Xiwen Chen; Xuanzhao Dong; Yalin Wang; Yanxi Chen; Yujian Xiong

arxiv: 2607.00223 · v1 · pith:WM7PXOQBnew · submitted 2026-06-30 · 💻 cs.CV

Does Your ViT Still Need U-Net for Segmentation?

Xin Li , Wenhui Zhu , Xuanzhao Dong , Xiwen Chen , Yanxi Chen , Yujian Xiong , Hao Wang , Oana M. Dumitrascu

show 1 more author

Yalin Wang

This is my paper

Pith reviewed 2026-07-02 19:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords medical image segmentationvision transformerencoder-only architectureU-Netquery modelingdense predictionpretrained ViT

0 comments

The pith

Modern ViT backbones make U-Net-style decoders unnecessary for medical image segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper questions whether Vision Transformers after large-scale pretraining still need the traditional U-Net encoder-decoder structure for medical image segmentation. It shows that an encoder-only approach can achieve strong results by using multi-level query modeling and learnable block fusion. This matters because it challenges a long-dominant design choice and points toward simpler architectures. The EoSeg framework tests this idea across seven datasets covering CT, MRI, histopathology, endoscopy, and dermoscopy. A sympathetic reader would see the work as evidence that decoder stages can be dropped when the backbone is strong enough.

Core claim

The paper claims that a U-Net-style decoder is no longer necessary for medical image segmentation with modern ViT backbones. It presents EoSeg as an effective encoder-only design realized through multi-level query modeling and learnable block fusion, with experiments on seven benchmark datasets confirming competitive performance across multiple medical imaging modalities.

What carries the argument

EoSeg, the encoder-only segmentation framework that performs dense prediction directly from a ViT backbone using multi-level query modeling and learnable block fusion.

If this is right

Medical segmentation models can be built from the ViT encoder alone without a separate decoder.
Multi-level query modeling enables the encoder to produce accurate pixel-level outputs directly.
Learnable fusion of ViT blocks can integrate hierarchical features for segmentation tasks.
The encoder-only approach generalizes across CT, MRI, histopathology, endoscopy, and dermoscopy.
Pretrained ViTs reduce the architectural need for U-Net-style decoders in segmentation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could lead to lighter models that run faster in clinical or real-time settings.
The query-based mechanism might extend to other dense prediction tasks such as detection in medical scans.
Stronger future pretraining could make encoder-only designs the default choice in vision.
Similar simplifications might apply to segmentation outside medicine once backbones are sufficiently capable.

Load-bearing premise

Large-scale pretraining has advanced ViT representations enough to support accurate dense prediction without any decoder stage.

What would settle it

A head-to-head test on the same ViT backbone showing that attaching a standard U-Net decoder produces consistently higher accuracy across several medical datasets would falsify the claim that the decoder is unnecessary.

Figures

Figures reproduced from arXiv: 2607.00223 by Hao Wang, Oana M. Dumitrascu, Wenhui Zhu, Xin Li, Xiwen Chen, Xuanzhao Dong, Yalin Wang, Yanxi Chen, Yujian Xiong.

**Figure 2.** Figure 2: Roadmap of EoSeg. Starting from pretrained Vi [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of EoSeg. (a) A pretrained DINOv2 backbone extracts visual representations from the input image. (b) Learnable [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Segmentation visualization comparison on the Synapse, GlaS, and MoNuSeg datasets. From top to bottom: multi-organ seg [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Medical image segmentation is dominated by U-Net-style encoder-decoder architectures. Vision Transformers (ViTs) overcome the limited receptive field of convolutional networks through self-attention, enabling modeling of long-range dependencies. Early ViT-based segmentation methods typically retained U-Net-style decoders because pretrained ViT representations were insufficient to support accurate dense prediction. Recent advances in large-scale pretraining have redefined the representation capability of ViTs, reducing the reliance on U-Net-style decoder architectures in modern vision models. This prompts two questions: Is the U-Net paradigm still necessary for medical image segmentation? If not, how should an encoder-only segmentation framework be designed? Motivated by these questions, we explore key architectural choices for encoder-only medical image segmentation based on modern ViT backbones and establish a query-based encoder-only design with multi-level query modeling and learnable block fusion, realized in Encoder-only Segmentation (EoSeg). Extensive experiments across seven benchmark datasets spanning CT, MRI, histopathology, endoscopy, and dermoscopy validate the effectiveness of the proposed design across diverse medical imaging modalities, including mDice scores of 85.50% on Synapse, 91.73% on ACDC, and 93.27% on GlaS. The results demonstrate that a U-Net-style decoder is no longer necessary for medical image segmentation with modern ViT backbones and further show that EoSeg provides an effective encoder-only design. Code is available at: https://github.com/Retinal-Research/EoSeg

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EoSeg gets competitive mDice numbers on seven medical datasets with an encoder-only ViT, but the abstract gives no sign of a controlled comparison against the same backbone plus a standard U-Net decoder.

read the letter

The main takeaway is that EoSeg posts solid scores (85.5 on Synapse, 91.73 on ACDC, 93.27 on GlaS) across CT, MRI, histopathology and other modalities using only a modern ViT encoder with multi-level queries and learnable block fusion. That design choice is the concrete new element.

The paper does a reasonable job running the same architecture on seven benchmarks and making the code public. The question it poses—whether large-scale pretraining has made U-Net decoders unnecessary—is timely for the medical segmentation crowd.

The soft spot is exactly the one flagged in the stress-test note. The central claim that a U-Net-style decoder is no longer necessary rests on showing that performance is not just coming from the pretrained backbone. The abstract does not describe a head-to-head run with the identical ViT encoder paired with a conventional decoder, nor any ablations on the query or fusion modules. Without those, the gains could be explained by backbone improvements alone. No error bars or variance numbers are mentioned either.

This is aimed at people building or benchmarking transformer segmentation models in medical imaging. A reader already working on query-based or decoder-free designs could pick up the specific fusion trick, but anyone wanting to cite the broader conclusion would need the missing baseline first.

I would send it to review if the authors add the controlled comparison and a couple of ablations; the current version leaves the necessity argument under-supported.

Referee Report

2 major / 2 minor

Summary. The paper questions whether U-Net-style encoder-decoder architectures remain necessary for medical image segmentation given recent advances in large-scale pretraining of Vision Transformers. It proposes EoSeg, a query-based encoder-only segmentation framework incorporating multi-level query modeling and learnable block fusion, and reports competitive mDice scores across seven datasets (e.g., 85.50% on Synapse, 91.73% on ACDC, 93.27% on GlaS) to argue that modern ViT backbones suffice without a U-Net decoder.

Significance. If the central claim is supported by controlled experiments, the work could meaningfully shift design practices in medical segmentation toward simpler encoder-only models, reducing architectural complexity while leveraging pretrained ViT representations. The provision of code further aids reproducibility.

major comments (2)

[Abstract / Experiments] The central claim that 'a U-Net-style decoder is no longer necessary' (Abstract) is load-bearing but unsupported without a controlled ablation: the same modern pretrained ViT backbone must be paired with a conventional U-Net decoder and compared directly to EoSeg on the reported datasets. No such baseline is described, so gains cannot be attributed to the encoder-only design rather than backbone/pretraining improvements alone.
[Experiments] §4 (or equivalent results section): the reported mDice scores lack error bars, statistical significance tests, or multiple runs, and do not include direct comparisons to recent ViT-based encoder-decoder baselines using identical backbones; this weakens the cross-dataset validation of the necessity claim.

minor comments (2)

[Methods] Notation for 'learnable block fusion' and 'multi-level query modeling' should be formalized with equations or pseudocode for clarity.
[Abstract] The abstract mentions seven datasets but details only three; a summary table of all results would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects for strengthening the experimental validation of our central claim regarding encoder-only segmentation with modern ViT backbones. We respond point-by-point below and commit to revisions that directly address the concerns raised.

read point-by-point responses

Referee: [Abstract / Experiments] The central claim that 'a U-Net-style decoder is no longer necessary' (Abstract) is load-bearing but unsupported without a controlled ablation: the same modern pretrained ViT backbone must be paired with a conventional U-Net decoder and compared directly to EoSeg on the reported datasets. No such baseline is described, so gains cannot be attributed to the encoder-only design rather than backbone/pretraining improvements alone.

Authors: We agree that a controlled ablation pairing the identical modern pretrained ViT backbone with a conventional U-Net-style decoder would provide the most direct evidence for attributing performance to the encoder-only design. While the manuscript includes comparisons to multiple ViT-based methods that incorporate decoder components, these do not constitute an exact matched-backbone control. In the revised manuscript, we will add this specific ablation experiment on the Synapse and ACDC datasets (and report results on additional datasets if space permits) to strengthen support for the claim. revision: yes
Referee: [Experiments] §4 (or equivalent results section): the reported mDice scores lack error bars, statistical significance tests, or multiple runs, and do not include direct comparisons to recent ViT-based encoder-decoder baselines using identical backbones; this weakens the cross-dataset validation of the necessity claim.

Authors: We acknowledge that including error bars from multiple runs, along with statistical significance testing, would improve the rigor of the reported results. We will perform additional runs with different random seeds for the main experiments and incorporate error bars plus appropriate statistical tests (such as Wilcoxon signed-rank tests) in the revised results section. We will also expand the baseline comparisons to explicitly identify and include any recent ViT-based encoder-decoder methods that share the same backbone and pretraining setup as EoSeg. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmarks are independent of the architectural claim

full rationale

The paper advances an empirical claim that modern pretrained ViT backbones render U-Net-style decoders unnecessary, supported by mDice scores on seven external medical imaging benchmarks (Synapse, ACDC, GlaS, etc.). No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the abstract or described methodology. The EoSeg design (multi-level query modeling, learnable block fusion) is introduced as an ansatz and then validated against held-out test sets rather than reducing to its own inputs by construction. The derivation chain is therefore self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on free parameters, axioms, or invented entities; assessment limited to surface claims about pretraining sufficiency.

pith-pipeline@v0.9.1-grok · 5829 in / 945 out tokens · 24195 ms · 2026-07-02T19:13:00.828881+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 8 canonical work pages · 5 internal anchors

[1]

Faysal Ahamed, Md

Md. Faysal Ahamed, Md. Khalid Syfullah, Ovi Sarkar, Md. Tohidul Islam, Md. Nahiduzzaman, Md. Rabiul Islam, Amith Abdullah Khandakar, Mohamed Arselene Ayari, and Muhammad Enamul Hoque Chowdhury. Irv2-net: A deep learning framework for enhanced polyp segmentation per- formance integrating inceptionresnetv2 and unet architec- ture with test time augmentation...

2023
[2]

Swin-unet: Unet-like pure transformer for medical image segmentation

Hu Cao, Yueyue Wang, Jieneng Chen, Dongsheng Jiang, Xi- aopeng Zhang, Qi Tian, and Manning Wang. Swin-unet: Unet-like pure transformer for medical image segmentation. InECCV Workshops, 2021. 1, 2, 3

2021
[3]

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan Loddon Yuille, and Yuyin Zhou. Transunet: Transformers make strong encoders for medical image segmentation.ArXiv, abs/2102.04306, 2021. 2, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Encoder-decoder with atrous separable convolution for semantic image segmentation

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision, 2018. 6

2018
[5]

Schwing, Alexan- der Kirillov, and Rohit Girdhar

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1280–1289, 2021. 5

2022
[6]

Automated cardiac diagnosis challenge (acdc).https://www.creatis.insa- lyon.fr/ Challenge/acdc/, 2017

CREATIS. Automated cardiac diagnosis challenge (acdc).https://www.creatis.insa- lyon.fr/ Challenge/acdc/, 2017. Accessed: 2026-06-19. 6

2017
[7]

Ms red: A novel multi-scale residual encoding and decoding network for skin lesion segmentation.Medical image analysis, 75:102293,

Duwei Dai, Caixia Dong, Songhua Xu, Qingsen Yan, Zong- fang Li, Chunyan Zhang, and Nana Luo. Ms red: A novel multi-scale residual encoding and decoding network for skin lesion segmentation.Medical image analysis, 75:102293,
[8]

Li, and Li Fei-Fei

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, K. Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database.2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 3

2009
[9]

Fac-net: Feedback attention network based on context en- coder network for skin lesion segmentation.Sensors (Basel, Switzerland), 21, 2021

Yuying Dong, Liejun Wang, Shuli Cheng, and Yongming Li. Fac-net: Feedback attention network based on context en- coder network for skin lesion segmentation.Sensors (Basel, Switzerland), 21, 2021. 2, 3

2021
[10]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.ArXiv, abs/2010.11929, 2020. 1, 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2010
[11]

Fu, Kang Zhou, Huaying Hao, Yitian Zhao, Tianyang Zhang, Shenghua Gao, and Jiang Liu

Zaiwang Gu, Jun Cheng, H. Fu, Kang Zhou, Huaying Hao, Yitian Zhao, Tianyang Zhang, Shenghua Gao, and Jiang Liu. Ce-net: Context encoder network for 2d medical im- age segmentation.IEEE Transactions on Medical Imaging, 38:2281–2292, 2019. 1

2019
[12]

Roth, and Daguang Xu

Ali Hatamizadeh, Dong Yang, Holger R. Roth, and Daguang Xu. Unetr: Transformers for 3d medical image segmenta- tion.2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1748–1758, 2021. 2, 3

2022
[13]

Girshick

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Pi- otr Doll’ar, and Ross B. Girshick. Masked autoencoders are scalable vision learners.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979–15988, 2021. 2

2022
[14]

Moein Heidari, Amirhossein Kazerouni, Milad Soltany Kadarvish, Reza Azad, Ehsan Khodapanah Aghdam, Julien Cohen-Adad, and Dorit Merhof. Hiformer: Hierarchical multi-scale representations using transformers for medical image segmentation.2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6191–6201,

2023
[15]

Isic 2016: Skin lesion analysis towards melanoma detection.https : //challenge.isic-archive.com/data/#2016,

International Skin Imaging Collaboration. Isic 2016: Skin lesion analysis towards melanoma detection.https : //challenge.isic-archive.com/data/#2016,

2016
[16]

Isic 2017: Skin lesion analysis towards melanoma detection.https : //challenge.isic-archive.com/data/#2017,

International Skin Imaging Collaboration. Isic 2017: Skin lesion analysis towards melanoma detection.https : //challenge.isic-archive.com/data/#2017,

2017
[17]

Jaeger, Simon A

Fabian Isensee, Paul F. Jaeger, Simon A. A. Kohl, Jens Pe- tersen, and Klaus Hermann Maier-Hein. nnu-net: a self- configuring method for deep learning-based biomedical im- age segmentation.Nature Methods, 18:203 – 211, 2020. 1, 2, 6

2020
[18]

Riegler, P

Debesh Jha, Pia Helen Smedsrud, M. Riegler, P. Halvorsen, Thomas de Lange, Dag Johansen, and Haavard D. Johansen. Kvasir-seg: A segmented polyp dataset. InConference on Multimedia Modeling, 2019. 6

2019
[19]

Riegler, Dag Johansen, Thomas de Lange, P

Debesh Jha, Pia Helen Smedsrud, M. Riegler, Dag Johansen, Thomas de Lange, P. Halvorsen, and H˚avard Dagenborg Jo- hansen. Resunet++: An advanced architecture for medical image segmentation.2019 IEEE International Symposium on Multimedia (ISM), pages 225–2255, 2019. 1

2019
[20]

Riegler, Dag Johansen, P

Debesh Jha, M. Riegler, Dag Johansen, P. Halvorsen, and Haavard D. Johansen. Doubleu-net: A deep convolutional neural network for medical image segmentation.2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS), pages 558–564, 2020

2020
[21]

Riegler, Haavard D

Debesh Jha, Nikhil Kumar Tomar, Sharib Ali, M. Riegler, Haavard D. Johansen, Dag Johansen, Thomas de Lange, and P. Halvorsen. Nanonet: Real-time polyp segmentation in video capsule endoscopy and colonoscopy.2021 IEEE 34th International Symposium on Computer-Based Medical Sys- tems (CBMS), pages 37–43, 2021. 1

2021
[22]

Your vit is secretly an image segmentation model.2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 25303– 25313, 2025

Tommie Kerssies, Niccol `o Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus. Your vit is secretly an image segmentation model.2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 25303– 25313, 2025. 2, 3 9

2025
[23]

Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross B

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chlo´e Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross B. Girshick. Segment anything.2023 IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 3992–4003, 2023. 2, 3

2023
[24]

Neeraj Kumar, Ruchika Verma, Sanuj Sharma, S. K. Bhar- gava, Abhishek Vahadane, and Amit Sethi. A dataset and a technique for generalized nuclear segmentation for compu- tational pathology.IEEE Transactions on Medical Imaging, 36:1550–1560, 2017. 6

2017
[25]

Dumitrascu, and Yalin Wang

Xin Li, Wenhui Zhu, Xuanzhao Dong, Oana M. Dumitrascu, and Yalin Wang. Evit-unet: U-net like efficient vision trans- former for medical image segmentation on mobile and edge devices.Proceedings. IEEE International Symposium on Biomedical Imaging, 2025, 2024. 1, 2, 3, 6

2025
[26]

Batformer: Towards boundary-aware lightweight trans- former for efficient medical image segmentation.IEEE Jour- nal of Biomedical and Health Informatics, 27:3501–3512,

Xian Lin, Li Yu, Kwang-Ting Cheng, and Zengqiang Yan. Batformer: Towards boundary-aware lightweight trans- former for efficient medical image segmentation.IEEE Jour- nal of Biomedical and Health Informatics, 27:3501–3512,
[27]

Ddanet: A deep dilated at- tention network for intracerebral haemorrhage segmentation

Haiyan Liu, Yu Zeng, Hao Li, Fuxing Wang, Jianjun Chang, Huaping Guo, and Jian Zhang. Ddanet: A deep dilated at- tention network for intracerebral haemorrhage segmentation. IET Systems Biology, 18:285 – 297, 2024. 1

2024
[28]

Swin transformer: Hierarchical vision transformer using shifted windows.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, 2021

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, 2021. 2

2021
[29]

Anam Memon and Ali Asghar Manjotho. Apformer: Anti- phishing transformer for website-phishing detection via joint feature learning.2024 International Conference on Engi- neering & Computing Technologies (ICECT), pages 1–5,

2024
[30]

Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russ Howes, Po-Yao (Bernie) Huang, Shang-Wen Li, Ishan Misra, Michael G. Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv´e J´egou, Julie...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

How do vision transformers work?ArXiv, abs/2202.06709, 2022

Namuk Park and Songkuk Kim. How do vision transformers work?ArXiv, abs/2202.06709, 2022. 5

work page arXiv 2022
[32]

Peebles and Saining Xie

William S. Peebles and Saining Xie. Scalable diffusion mod- els with transformers.2023 IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 4172–4182, 2022. 2, 3

2023
[33]

U-Net: Convolutional Networks for Biomedical Image Segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. ArXiv, abs/1505.04597, 2015. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2015
[34]

Fully convolutional networks for semantic segmentation.2015 IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 3431–3440, 2014

Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutional networks for semantic segmentation.2015 IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 3431–3440, 2014. 1, 2

2015
[35]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Korsuk Sirinukunwattana, Josien P. W. Pluim, Hao Chen, Xi- aojuan Qi, Pheng-Ann Heng, Yun Bo Guo, Li Yang Wang, Bogdan J. Matuszewski, Elia Bruni, Urko Sanchez, Anton B¨ohm, Olaf Ronneberger, Bassem Ben Cheikh, Daniel Raco- ceanu, Philipp Kainz, Michael Pfeiffer, Martin Urschler, David R. J. Snead, and Nasir M. Rajpoot. Gland segmen- tation in colon histo...

2016
[37]

Multi-atlas labeling beyond the cranial vault – workshop and challenge.https://www.synapse

Synapse. Multi-atlas labeling beyond the cranial vault – workshop and challenge.https://www.synapse. org/Synapse:syn3193805/wiki/217789, 2015. Accessed: 2026-06-19. 4, 6

2015
[38]

Jeya Maria Jose Valanarasu and Vishal M. Patel. Un- ext: Mlp-based rapid medical image segmentation network. ArXiv, abs/2203.04967, 2022. 1

work page arXiv 2022
[39]

Di Wang, Jing Zhang, Minqiang Xu, Lin Liu, Dongsheng Wang, Erzhong Gao, Chengxi Han, Haonan Guo, Bo Du, Dacheng Tao, and L. Zhang. Mtp: Advancing remote sens- ing foundation model via multitask pretraining.IEEE Jour- nal of Selected Topics in Applied Earth Observations and Remote Sensing, 17:11632–11654, 2024. 2, 3, 4

2024
[40]

Set: Su- perpixel embedded transformer for skin lesion segmentation

Zhonghua Wang, Junhao Lyu, and Xiaoying Tang. Set: Su- perpixel embedded transformer for skin lesion segmentation. Medical Image Analysis, 105:103738, 2025. 6

2025
[41]

Levit-unet: Make faster encoders with transformer for med- ical image segmentation.ArXiv, abs/2107.08623, 2021

Guoping Xu, Xingrong Wu, Xuan Zhang, and Xinwei He. Levit-unet: Make faster encoders with transformer for med- ical image segmentation.ArXiv, abs/2107.08623, 2021. 2, 3

work page arXiv 2021
[42]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. 2023 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 11941–11952, 2023. 2, 3, 4

2023
[43]

Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang. Unet++: A nested u-net architecture for medical image segmentation.Deep Learn- ing in Medical Image Analysis and Multimodal Learning for Clinical Decision Support : 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, held in conjunction with MI...

2018
[44]

Selfreg- unet: Self-regularized unet for medical image segmentation

Wenhui Zhu, Xiwen Chen, Peijie Qiu, Mohammad Farazi, Aristeidis Sotiras, Abolfazl Razi, and Yalin Wang. Selfreg- unet: Self-regularized unet for medical image segmentation. Medical image computing and computer-assisted interven- tion : MICCAI ... International Conference on Medical Im- age Computing and Computer-Assisted Intervention, 15008: 601 – 611, 20...

2024

[1] [1]

Faysal Ahamed, Md

Md. Faysal Ahamed, Md. Khalid Syfullah, Ovi Sarkar, Md. Tohidul Islam, Md. Nahiduzzaman, Md. Rabiul Islam, Amith Abdullah Khandakar, Mohamed Arselene Ayari, and Muhammad Enamul Hoque Chowdhury. Irv2-net: A deep learning framework for enhanced polyp segmentation per- formance integrating inceptionresnetv2 and unet architec- ture with test time augmentation...

2023

[2] [2]

Swin-unet: Unet-like pure transformer for medical image segmentation

Hu Cao, Yueyue Wang, Jieneng Chen, Dongsheng Jiang, Xi- aopeng Zhang, Qi Tian, and Manning Wang. Swin-unet: Unet-like pure transformer for medical image segmentation. InECCV Workshops, 2021. 1, 2, 3

2021

[3] [3]

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan Loddon Yuille, and Yuyin Zhou. Transunet: Transformers make strong encoders for medical image segmentation.ArXiv, abs/2102.04306, 2021. 2, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

Encoder-decoder with atrous separable convolution for semantic image segmentation

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision, 2018. 6

2018

[5] [5]

Schwing, Alexan- der Kirillov, and Rohit Girdhar

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1280–1289, 2021. 5

2022

[6] [6]

Automated cardiac diagnosis challenge (acdc).https://www.creatis.insa- lyon.fr/ Challenge/acdc/, 2017

CREATIS. Automated cardiac diagnosis challenge (acdc).https://www.creatis.insa- lyon.fr/ Challenge/acdc/, 2017. Accessed: 2026-06-19. 6

2017

[7] [7]

Ms red: A novel multi-scale residual encoding and decoding network for skin lesion segmentation.Medical image analysis, 75:102293,

Duwei Dai, Caixia Dong, Songhua Xu, Qingsen Yan, Zong- fang Li, Chunyan Zhang, and Nana Luo. Ms red: A novel multi-scale residual encoding and decoding network for skin lesion segmentation.Medical image analysis, 75:102293,

[8] [8]

Li, and Li Fei-Fei

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, K. Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database.2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 3

2009

[9] [9]

Fac-net: Feedback attention network based on context en- coder network for skin lesion segmentation.Sensors (Basel, Switzerland), 21, 2021

Yuying Dong, Liejun Wang, Shuli Cheng, and Yongming Li. Fac-net: Feedback attention network based on context en- coder network for skin lesion segmentation.Sensors (Basel, Switzerland), 21, 2021. 2, 3

2021

[10] [10]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.ArXiv, abs/2010.11929, 2020. 1, 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2010

[11] [11]

Fu, Kang Zhou, Huaying Hao, Yitian Zhao, Tianyang Zhang, Shenghua Gao, and Jiang Liu

Zaiwang Gu, Jun Cheng, H. Fu, Kang Zhou, Huaying Hao, Yitian Zhao, Tianyang Zhang, Shenghua Gao, and Jiang Liu. Ce-net: Context encoder network for 2d medical im- age segmentation.IEEE Transactions on Medical Imaging, 38:2281–2292, 2019. 1

2019

[12] [12]

Roth, and Daguang Xu

Ali Hatamizadeh, Dong Yang, Holger R. Roth, and Daguang Xu. Unetr: Transformers for 3d medical image segmenta- tion.2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1748–1758, 2021. 2, 3

2022

[13] [13]

Girshick

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Pi- otr Doll’ar, and Ross B. Girshick. Masked autoencoders are scalable vision learners.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979–15988, 2021. 2

2022

[14] [14]

Moein Heidari, Amirhossein Kazerouni, Milad Soltany Kadarvish, Reza Azad, Ehsan Khodapanah Aghdam, Julien Cohen-Adad, and Dorit Merhof. Hiformer: Hierarchical multi-scale representations using transformers for medical image segmentation.2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6191–6201,

2023

[15] [15]

Isic 2016: Skin lesion analysis towards melanoma detection.https : //challenge.isic-archive.com/data/#2016,

International Skin Imaging Collaboration. Isic 2016: Skin lesion analysis towards melanoma detection.https : //challenge.isic-archive.com/data/#2016,

2016

[16] [16]

Isic 2017: Skin lesion analysis towards melanoma detection.https : //challenge.isic-archive.com/data/#2017,

International Skin Imaging Collaboration. Isic 2017: Skin lesion analysis towards melanoma detection.https : //challenge.isic-archive.com/data/#2017,

2017

[17] [17]

Jaeger, Simon A

Fabian Isensee, Paul F. Jaeger, Simon A. A. Kohl, Jens Pe- tersen, and Klaus Hermann Maier-Hein. nnu-net: a self- configuring method for deep learning-based biomedical im- age segmentation.Nature Methods, 18:203 – 211, 2020. 1, 2, 6

2020

[18] [18]

Riegler, P

Debesh Jha, Pia Helen Smedsrud, M. Riegler, P. Halvorsen, Thomas de Lange, Dag Johansen, and Haavard D. Johansen. Kvasir-seg: A segmented polyp dataset. InConference on Multimedia Modeling, 2019. 6

2019

[19] [19]

Riegler, Dag Johansen, Thomas de Lange, P

Debesh Jha, Pia Helen Smedsrud, M. Riegler, Dag Johansen, Thomas de Lange, P. Halvorsen, and H˚avard Dagenborg Jo- hansen. Resunet++: An advanced architecture for medical image segmentation.2019 IEEE International Symposium on Multimedia (ISM), pages 225–2255, 2019. 1

2019

[20] [20]

Riegler, Dag Johansen, P

Debesh Jha, M. Riegler, Dag Johansen, P. Halvorsen, and Haavard D. Johansen. Doubleu-net: A deep convolutional neural network for medical image segmentation.2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS), pages 558–564, 2020

2020

[21] [21]

Riegler, Haavard D

Debesh Jha, Nikhil Kumar Tomar, Sharib Ali, M. Riegler, Haavard D. Johansen, Dag Johansen, Thomas de Lange, and P. Halvorsen. Nanonet: Real-time polyp segmentation in video capsule endoscopy and colonoscopy.2021 IEEE 34th International Symposium on Computer-Based Medical Sys- tems (CBMS), pages 37–43, 2021. 1

2021

[22] [22]

Your vit is secretly an image segmentation model.2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 25303– 25313, 2025

Tommie Kerssies, Niccol `o Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus. Your vit is secretly an image segmentation model.2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 25303– 25313, 2025. 2, 3 9

2025

[23] [23]

Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross B

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chlo´e Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross B. Girshick. Segment anything.2023 IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 3992–4003, 2023. 2, 3

2023

[24] [24]

Neeraj Kumar, Ruchika Verma, Sanuj Sharma, S. K. Bhar- gava, Abhishek Vahadane, and Amit Sethi. A dataset and a technique for generalized nuclear segmentation for compu- tational pathology.IEEE Transactions on Medical Imaging, 36:1550–1560, 2017. 6

2017

[25] [25]

Dumitrascu, and Yalin Wang

Xin Li, Wenhui Zhu, Xuanzhao Dong, Oana M. Dumitrascu, and Yalin Wang. Evit-unet: U-net like efficient vision trans- former for medical image segmentation on mobile and edge devices.Proceedings. IEEE International Symposium on Biomedical Imaging, 2025, 2024. 1, 2, 3, 6

2025

[26] [26]

Batformer: Towards boundary-aware lightweight trans- former for efficient medical image segmentation.IEEE Jour- nal of Biomedical and Health Informatics, 27:3501–3512,

Xian Lin, Li Yu, Kwang-Ting Cheng, and Zengqiang Yan. Batformer: Towards boundary-aware lightweight trans- former for efficient medical image segmentation.IEEE Jour- nal of Biomedical and Health Informatics, 27:3501–3512,

[27] [27]

Ddanet: A deep dilated at- tention network for intracerebral haemorrhage segmentation

Haiyan Liu, Yu Zeng, Hao Li, Fuxing Wang, Jianjun Chang, Huaping Guo, and Jian Zhang. Ddanet: A deep dilated at- tention network for intracerebral haemorrhage segmentation. IET Systems Biology, 18:285 – 297, 2024. 1

2024

[28] [28]

Swin transformer: Hierarchical vision transformer using shifted windows.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, 2021

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, 2021. 2

2021

[29] [29]

Anam Memon and Ali Asghar Manjotho. Apformer: Anti- phishing transformer for website-phishing detection via joint feature learning.2024 International Conference on Engi- neering & Computing Technologies (ICECT), pages 1–5,

2024

[30] [30]

Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russ Howes, Po-Yao (Bernie) Huang, Shang-Wen Li, Ishan Misra, Michael G. Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv´e J´egou, Julie...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

How do vision transformers work?ArXiv, abs/2202.06709, 2022

Namuk Park and Songkuk Kim. How do vision transformers work?ArXiv, abs/2202.06709, 2022. 5

work page arXiv 2022

[32] [32]

Peebles and Saining Xie

William S. Peebles and Saining Xie. Scalable diffusion mod- els with transformers.2023 IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 4172–4182, 2022. 2, 3

2023

[33] [33]

U-Net: Convolutional Networks for Biomedical Image Segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. ArXiv, abs/1505.04597, 2015. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2015

[34] [34]

Fully convolutional networks for semantic segmentation.2015 IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 3431–3440, 2014

Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutional networks for semantic segmentation.2015 IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 3431–3440, 2014. 1, 2

2015

[35] [35]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Korsuk Sirinukunwattana, Josien P. W. Pluim, Hao Chen, Xi- aojuan Qi, Pheng-Ann Heng, Yun Bo Guo, Li Yang Wang, Bogdan J. Matuszewski, Elia Bruni, Urko Sanchez, Anton B¨ohm, Olaf Ronneberger, Bassem Ben Cheikh, Daniel Raco- ceanu, Philipp Kainz, Michael Pfeiffer, Martin Urschler, David R. J. Snead, and Nasir M. Rajpoot. Gland segmen- tation in colon histo...

2016

[37] [37]

Multi-atlas labeling beyond the cranial vault – workshop and challenge.https://www.synapse

Synapse. Multi-atlas labeling beyond the cranial vault – workshop and challenge.https://www.synapse. org/Synapse:syn3193805/wiki/217789, 2015. Accessed: 2026-06-19. 4, 6

2015

[38] [38]

Jeya Maria Jose Valanarasu and Vishal M. Patel. Un- ext: Mlp-based rapid medical image segmentation network. ArXiv, abs/2203.04967, 2022. 1

work page arXiv 2022

[39] [39]

Di Wang, Jing Zhang, Minqiang Xu, Lin Liu, Dongsheng Wang, Erzhong Gao, Chengxi Han, Haonan Guo, Bo Du, Dacheng Tao, and L. Zhang. Mtp: Advancing remote sens- ing foundation model via multitask pretraining.IEEE Jour- nal of Selected Topics in Applied Earth Observations and Remote Sensing, 17:11632–11654, 2024. 2, 3, 4

2024

[40] [40]

Set: Su- perpixel embedded transformer for skin lesion segmentation

Zhonghua Wang, Junhao Lyu, and Xiaoying Tang. Set: Su- perpixel embedded transformer for skin lesion segmentation. Medical Image Analysis, 105:103738, 2025. 6

2025

[41] [41]

Levit-unet: Make faster encoders with transformer for med- ical image segmentation.ArXiv, abs/2107.08623, 2021

Guoping Xu, Xingrong Wu, Xuan Zhang, and Xinwei He. Levit-unet: Make faster encoders with transformer for med- ical image segmentation.ArXiv, abs/2107.08623, 2021. 2, 3

work page arXiv 2021

[42] [42]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. 2023 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 11941–11952, 2023. 2, 3, 4

2023

[43] [43]

Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang. Unet++: A nested u-net architecture for medical image segmentation.Deep Learn- ing in Medical Image Analysis and Multimodal Learning for Clinical Decision Support : 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, held in conjunction with MI...

2018

[44] [44]

Selfreg- unet: Self-regularized unet for medical image segmentation

Wenhui Zhu, Xiwen Chen, Peijie Qiu, Mohammad Farazi, Aristeidis Sotiras, Abolfazl Razi, and Yalin Wang. Selfreg- unet: Self-regularized unet for medical image segmentation. Medical image computing and computer-assisted interven- tion : MICCAI ... International Conference on Medical Im- age Computing and Computer-Assisted Intervention, 15008: 601 – 611, 20...

2024