pith. sign in

arxiv: 2503.01835 · v2 · submitted 2025-03-03 · 💻 cs.CV

Primus: Enforcing Attention Usage for 3D Medical Image Segmentation

Pith reviewed 2026-05-23 01:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D medical image segmentationTransformer architectureattention mechanismsconvolutional neural networkssemantic segmentationmedical imaging
0
0 comments X

The pith

Pure Transformer models without convolutional blocks now match or beat top CNNs on 3D medical image segmentation benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that most existing Transformer-based segmentation models still depend heavily on convolutional blocks, so that removing the attention parts barely hurts performance. It then builds two new architectures, Primus and PrimusV2, that use only Transformer blocks by keeping high-resolution tokens, improved positional embeddings, and an iterative patch-embedding scheme. If these designs work, they demonstrate that attention can be made effective in 3D medical volumes once the model is forced to rely on it, rather than falling back to local convolution. A reader would care because medical imaging has long been dominated by CNNs; a competitive pure-Transformer route would change which scaling laws and pre-training strategies become viable.

Core claim

Current Transformer segmentation models are limited because they over-rely on convolutional blocks; performance often stays the same when the Transformer blocks are removed. By moving to fully Transformer-centric designs called Primus (high-resolution tokens plus advances in positional embeddings and block design) and PrimusV2 (adding iterative patch embedding), the authors produce the first models that surpass prior Transformer hybrids, compete with a default nnU-Net, and match state-of-the-art CNNs such as ResEnc-L and MedNeXt on nine public datasets, thereby establishing competitive Transformer-centric segmentation.

What carries the argument

Primus and PrimusV2 architectures that enforce attention usage by removing all convolutional blocks and relying on high-resolution tokens, refined positional embeddings, and iterative patch embedding.

If this is right

  • Primus already exceeds earlier Transformer hybrids and matches a default nnU-Net.
  • PrimusV2 further surpasses the nnU-Net baseline and reaches parity with leading CNNs across nine datasets.
  • Transformers can now be treated as a viable, state-of-the-art backbone for 3D medical segmentation without hybrid crutches.
  • Future scaling of these models becomes possible because they no longer hide their capacity inside convolutional layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the same enforcement principle applies to other dense-prediction tasks, pure attention models could replace hybrids in video or 3D reconstruction as well.
  • The result suggests that earlier comparisons between Transformers and CNNs in medical imaging were confounded by incomplete use of attention, so re-evaluations with forced-attention baselines may be needed.
  • Practitioners could now test whether large-scale self-supervised pre-training on unlabeled volumes yields larger gains for Primus-style models than it did for hybrids.

Load-bearing premise

That the measured gains come from forcing the model to use attention rather than from any uncontrolled differences in training schedule, data augmentation, or hyper-parameters between Primus and the baselines it is compared against.

What would settle it

Retrain the strongest prior hybrid Transformer models using exactly the same training schedule, augmentation pipeline, and hyper-parameters as PrimusV2; if they still lag behind, the claim that architecture alone explains the gap would be weakened.

Figures

Figures reproduced from arXiv: 2503.01835 by Constantin Ulrich, Dasha Trofimova, Fabian Isensee, Gregor K\"ohler, Klaus Maier-Hein, Michael Baumgartner, Raphael Stock, Saikat Roy, Sebastian Ziegler, Tassilo Wald.

Figure 1
Figure 1. Figure 1: Effective Transformer-based networks have low UNet-index and high performance. In Fig. 1a, we observe that existing architectures mostly do not outperform a similarly trained UNet, on 2 datasets: For TotalSegmentator-BTCV, 8 out of 9 and for KiTS19, all 9. Further, we demonstrate in Fig. 1b on both datasets that 6 out of 9 architectures do not even show a 3% loss of performance (δTR) on completely removing… view at source ↗
Figure 2
Figure 2. Figure 2: Scaling Dataset size does not fix the challenges with Transformer-based representation learning. Increasing training data on TotalSegmentator-BTCV (1000 3D volumes) only seems to increase the gap between Transformer and no Transformer in 4 out of 9 architectures (UNETR, SwinUNETR, SwinUNet, TransFuse). As reference we include a default nnU-Net. 2.2. Do large Datasets fix this issue? The difficulties of tra… view at source ↗
Figure 3
Figure 3. Figure 3: Primus is a Transformer-heavy architecture with limited convolution layers. The architecture extracts high￾resolution 3D visual tokens through a single convolution layer with kernel size (k×k×k) and stride (k×k×k) through small k. Once in sequence format, it uses the Eva-02 [17] Transformer ar￾chitecture, featuring a Rotary Position Embedding (RoPE) adapted to 3D and the Eva-02 MLP Block. The lightweight d… view at source ↗
Figure 4
Figure 4. Figure 4: Segmentation performance pre-and-post Identity replacement of a Transformer module quantifies their importance. By replacing the entire Transformer block, including LayerNorm, Multi-Head Self-Attention or Shifted Window Multi-head Self-Attention, the influence of the entire Transformer within an architecture can be evaluated. A. Experiment Details In the following sections, we provide details on the experi… view at source ↗
Figure 5
Figure 5. Figure 5: MICCAI challenges categorized by their task. Since a long time at least 50% of challenges only focus on semantic segmentation [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Medical image segmentation datasets are significantly smaller and sparsely-labeled compared to their natural image counterparts. Our dataset visualization (Left) illustrates this chasm by the Average Percentage of Image/Volume Labeled vs. Number of Samples of datasets from both domains. Radii visualizes pixel/voxels over the whole dataset. However, the original evaluation of our 9 Transformer-based models … view at source ↗
Figure 7
Figure 7. Figure 7: Impact of Transformer blocks on learned representations across different architectures. We measure the representational similarity using centered kernel alignment (CKA) between multiple training runs of the same Transformer architecture (black) and between a Transformer architecture and its variant where Transformer blocks are replaced with identity mappings (blue). The gray-shaded region highlights the ga… view at source ↗
Figure 8
Figure 8. Figure 8: Visualization which positions we select to extract activations from. We select all representations at positions along the red line, [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
read the original abstract

Transformers have achieved remarkable success across multiple fields, yet their impact on 3D medical image segmentation remains limited with convolutional networks still dominating major benchmarks. In this work, (A) we analyze current Transformer-based segmentation models and identify critical shortcomings, particularly their over-reliance on convolutional blocks. Further, we demonstrate that in some architectures, performance is unaffected by the absence of the Transformer, thereby demonstrating their limited effectiveness. To address these challenges, we move away from hybrid architectures and (B) introduce Transformer-centric segmentation architectures, termed Primus and PrimusV2. Primus leverages high-resolution tokens, combined with advances in positional embeddings and block design, to maximally leverage its Transformer blocks, while PrimusV2 expands on this through an iterative patch embedding. Through these adaptations, Primus surpasses current Transformer-based methods and competes with a default nnU-Net while PrimusV2 exceeds it and is on par with the state-of-the-art CNNs such as ResEnc-L and MedNeXt architectures across nine public datasets. In doing so, we introduce the first competitive Transformer-centric model, making Transformers state-of-the-art in 3D medical image segmentation. The code is available here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/primus.md.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper analyzes limitations in existing Transformer-based 3D medical image segmentation models, particularly their over-reliance on convolutional blocks and cases where performance is unaffected by removing the Transformer component. It introduces two Transformer-centric architectures, Primus and PrimusV2, that use high-resolution tokens, advances in positional embeddings and block design, and (for PrimusV2) iterative patch embedding to enforce attention usage. These are reported to surpass prior Transformer-based methods, compete with or exceed a default nnU-Net, and match SOTA CNNs (ResEnc-L, MedNeXt) across nine public datasets, establishing the first competitive pure-Transformer model and making Transformers state-of-the-art in the domain. Code is released.

Significance. If the performance gains are shown to arise specifically from the architectural mechanisms that enforce attention usage under matched training conditions, the work would be significant: it would provide the first credible demonstration that a pure Transformer can reach or exceed the performance of dominant CNN and hybrid models on standard 3D medical segmentation benchmarks. The public code release is a clear strength that enables direct verification and extension.

major comments (2)
  1. [Abstract] Abstract: the central claim that Primus/PrimusV2 gains are attributable to high-resolution tokens, positional embeddings, block design, and iterative patch embedding (i.e., to enforced attention usage) is load-bearing on the assumption that all compared models were trained under identical schedules, augmentations, optimizers, and hyperparameters; the abstract provides no statement that baselines were re-trained under the authors' protocol, leaving the attribution unestablished.
  2. [Abstract] Abstract / Experiments: no details are supplied on the number of runs, statistical testing (e.g., paired t-tests or Wilcoxon tests with correction), or variance across the nine datasets; without these, the reported ranking (PrimusV2 on par with ResEnc-L/MedNeXt) cannot be assessed for robustness.
minor comments (1)
  1. [Abstract] The GitHub link is given but the main text does not describe the exact repository structure or reproduction instructions, which would aid readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for clearer statements on training protocols and experimental robustness. We will revise the abstract and experiments section to address these points directly. Our responses to the major comments follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that Primus/PrimusV2 gains are attributable to high-resolution tokens, positional embeddings, block design, and iterative patch embedding (i.e., to enforced attention usage) is load-bearing on the assumption that all compared models were trained under identical schedules, augmentations, optimizers, and hyperparameters; the abstract provides no statement that baselines were re-trained under the authors' protocol, leaving the attribution unestablished.

    Authors: We agree the abstract should explicitly address the comparison protocol. In the manuscript, comparisons to prior Transformer-based methods use their originally published results; the nnU-Net is the default implementation from the nnU-Net framework; and ResEnc-L/MedNeXt results are from published benchmarks. Our models were trained under the protocol matching the default nnU-Net. We will revise the abstract to state this clearly and add a sentence noting that full re-training of all external baselines under identical conditions was not performed owing to computational cost, while the independent analysis of attention limitations (Section 3) stands on its own. revision: yes

  2. Referee: [Abstract] Abstract / Experiments: no details are supplied on the number of runs, statistical testing (e.g., paired t-tests or Wilcoxon tests with correction), or variance across the nine datasets; without these, the reported ranking (PrimusV2 on par with ResEnc-L/MedNeXt) cannot be assessed for robustness.

    Authors: We acknowledge the absence of these details. The manuscript reports results from single training runs per model per dataset, which is standard in this domain due to the high cost of 3D training. We will revise the abstract and add a dedicated paragraph in the experiments section stating the number of runs (one per configuration), confirming that formal statistical tests were not applied, and noting that consistent performance across nine heterogeneous datasets provides supporting evidence of robustness. Additional multi-seed variance experiments could be included if requested but would require substantial new compute. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical architecture comparison with no derivation chain

full rationale

The paper introduces Primus and PrimusV2 as Transformer-centric 3D segmentation models and supports its claims solely through empirical benchmarking on nine public datasets, showing competitive or superior performance versus prior hybrids and CNNs. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided text. The central claim reduces to experimental results rather than any self-referential reduction of outputs to inputs by construction, making the derivation chain self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract describes an empirical architecture paper with no explicit mathematical axioms, free parameters, or invented physical entities.

pith-pipeline@v0.9.0 · 5794 in / 1114 out tokens · 24722 ms · 2026-05-23T01:19:49.572549+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities

    cs.CV 2026-04 unverdicted novelty 5.0

    UniME combines a pretrained unified ViT encoder with modality-specific CNN encoders to improve brain tumor segmentation performance when some MRI modalities are missing.

Reference graph

Works this paper leans on

110 extracted references · 110 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    Trans- formers in time-series analysis: A tutorial

    Sabeen Ahmed, Ian E Nielsen, Aakash Tripathi, Shamoon Siddiqui, Ravi P Ramachandran, and Ghulam Rasool. Trans- formers in time-series analysis: A tutorial. Circuits, Systems, and Signal Processing, pages 1–34, 2023. 1

  2. [2]

    Transformers in remote sens- ing: A survey

    Abdulaziz Amer Aleissaee, Amandeep Kumar, Rao Muham- mad Anwer, Salman Khan, Hisham Cholakkal, Gui-Song Xia, and Fahad Shahbaz Khan. Transformers in remote sens- ing: A survey. Remote Sensing, 15(7):1860, 2023. 1

  3. [3]

    Object de- tection using deep learning, cnns and vision transformers: A review

    Ayoub Benali Amjoud and Mustapha Amrouch. Object de- tection using deep learning, cnns and vision transformers: A review. IEEE Access, 2023. 16

  4. [4]

    Self-supervised learning from images with a joint-embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023. 2

  5. [5]

    Touch- stone benchmark: Are we on the right way for evaluating ai algorithms for medical segmentation? arXiv preprint arXiv:2411.03670, 2024

    Pedro RAS Bassi, Wenxuan Li, Yucheng Tang, Fabian Isensee, Zifu Wang, Jieneng Chen, Yu-Cheng Chou, Yannick Kirchhoff, Maximilian Rokuss, Ziyan Huang, et al. Touch- stone benchmark: Are we on the right way for evaluating ai algorithms for medical segmentation? arXiv preprint arXiv:2411.03670, 2024. 1, 4

  6. [6]

    Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE TMI, 2018

    Olivier Bernard, Alain Lalande, Clement Zotti, Cervenansky, and et al. Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE TMI, 2018. 5

  7. [7]

    The liver tumor segmentation benchmark (lits)

    Patrick Bilic, Patrick Christ, Hongwei Bran Li, Eugene V orontsov, Avi Ben-Cohen, Georgios Kaissis, Adi Szeskin, Colin Jacobs, Gabriel Efrain Humpire Mamani, Gabriel Chartrand, et al. The liver tumor segmentation benchmark (lits). Medical Image Analysis, 84:102680, 2023. 5

  8. [8]

    Swin-unet: Unet-like pure transformer for medical image segmentation

    Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xi- aopeng Zhang, Qi Tian, and Manning Wang. Swin-unet: Unet-like pure transformer for medical image segmentation. In European conference on computer vision, pages 205–218. Springer, 2022. 1, 2, 15, 17, 18

  9. [9]

    A survey on evaluation of large language models

    Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109, 2023. 1

  10. [10]

    Transattunet: Multi-level attention- guided u-net with transformer for medical image segmenta- tion

    Bingzhi Chen, Yishu Liu, Zheng Zhang, Guangming Lu, and Adams Wai Kin Kong. Transattunet: Multi-level attention- guided u-net with transformer for medical image segmenta- tion. IEEE Transactions on Emerging Topics in Computa- tional Intelligence, 2023. 17

  11. [11]

    TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

    Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L. Yuille, and Yuyin Zhou. TransUNet: Transformers Make Strong Encoders for Medi- cal Image Segmentation. arXiv preprint arXiv:2102.04306,

  12. [12]

    Transunet: Rethinking the u-net architec- ture design for medical image segmentation through the lens of transformers

    Jieneng Chen, Jieru Mei, Xianhang Li, Yongyi Lu, Qihang Yu, Qingyue Wei, Xiangde Luo, Yutong Xie, Ehsan Adeli, Yan Wang, et al. Transunet: Rethinking the u-net architec- ture design for medical image segmentation through the lens of transformers. Medical Image Analysis, 97:103280, 2024. 1

  13. [13]

    Mask2former for video instance segmentation

    Bowen Cheng, Anwesa Choudhuri, Ishan Misra, Alexan- der Kirillov, Rohit Girdhar, and Alexander G Schwing. Mask2former for video instance segmentation. arXiv preprint arXiv:2112.10764, 2021. 1

  14. [14]

    Per- pixel classification is not all you need for semantic segmen- tation

    Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- pixel classification is not all you need for semantic segmen- tation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021. 1

  15. [15]

    Vision Transformers Need Registers

    Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Pi- otr Bojanowski. Vision transformers need registers. arXiv preprint arXiv:2309.16588, 2023. 7 9

  16. [16]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Rep- resentations, ICLR. O...

  17. [17]

    Eva-02: A visual representation for neon genesis

    Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xin- long Wang, and Yue Cao. Eva-02: A visual representation for neon genesis. Image and Vision Computing, 149:105171,

  18. [18]

    Utnet: a hybrid transformer architecture for medical image segmen- tation

    Yunhe Gao, Mu Zhou, and Dimitris N Metaxas. Utnet: a hybrid transformer architecture for medical image segmen- tation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Pro- ceedings, Part III 24 , pages 61–71. Springer, 2021. 2, 14, 15, 17, 18

  19. [19]

    Abo-Elhoda, Sara W

    Lidia Garrucho, Claire-Anne Reidel, Kaisar Kushibar, Sm- riti Joshi, Richard Osuala, Apostolia Tsirikoglou, Ma- ciej Bobowicz, Javier del Riego, Alessandro Catanese, Katarzyna Gwo ´zdziewicz, Maria-Laura Cosaka, Pasant M. Abo-Elhoda, Sara W. Tantawy, Shorouq S. Sakrana, Norhan O. Shawky-Abdelfatah, Amr Muhammad Abdo- Salem, Androniki Kozana, Eugen Divjak,...

  20. [20]

    Deep learning enables au- tomatic detection and segmentation of brain metastases on multisequence mri

    Endre Grøvik, Darvin Yi, Michael Iv, Elizabeth Tong, Daniel Rubin, and Greg Zaharchuk. Deep learning enables au- tomatic detection and segmentation of brain metastases on multisequence mri. Journal of Magnetic Resonance Imag- ing, 51(1):175–182, 2020. 7

  21. [21]

    Developing general- ist foundation models from a multimodal dataset for 3d com- puted tomography

    Ibrahim Ethem Hamamci, Sezgin Er, Furkan Almas, Ayse Gulnihan Simsek, Sevval Nil Esirgun, Irem Dogan, Muhammed Furkan Dasdelen, Omer Faruk Durugol, Bastian Wittmann, Tamaz Amiranashvili, et al. Developing general- ist foundation models from a multimodal dataset for 3d com- puted tomography. 2024. 2

  22. [22]

    Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images

    Ali Hatamizadeh, Vishwesh Nath, Yucheng Tang, Dong Yang, Holger R Roth, and Daguang Xu. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In International MICCAI Brainlesion Workshop, pages 272–284. Springer, 2021. 1, 2, 7, 8, 15, 16, 17, 18

  23. [23]

    Unetr: Transformers for 3d med- ical image segmentation

    Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myronenko, Bennett Landman, Holger R Roth, and Daguang Xu. Unetr: Transformers for 3d med- ical image segmentation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 574–584, 2022. 1, 2, 4, 5, 7, 8, 15, 16, 17, 18

  24. [24]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 16000– 16009, 2022. 2

  25. [25]

    Transformers in medical image analysis

    Kelei He, Chen Gan, Zhuoyuan Li, Islem Rekik, Zihao Yin, Wen Ji, Yang Gao, Qian Wang, Junfeng Zhang, and Ding- gang Shen. Transformers in medical image analysis. Intelli- gent Medicine, 3(1):59–78, 2023. 17

  26. [26]

    Swinunetr-v2: Stronger swin transformers with stagewise convolutions for 3d med- ical image segmentation

    Yufan He, Vishwesh Nath, Dong Yang, Yucheng Tang, An- driy Myronenko, and Daguang Xu. Swinunetr-v2: Stronger swin transformers with stagewise convolutions for 3d med- ical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Interven- tion, pages 416–426. Springer, 2023. 1

  27. [27]

    The KiTS19 challenge data: 300 kidney tumor cases with clinical context, CT semantic segmenta- tions, and surgical outcomes

    Nicholas Heller, Niranjan Sathianathen, Arveen Kalapara, Edward Walczak, Keenan Moore, Heather Kaluzniak, Joel Rosenberg, Paul Blake, Zachary Rengel, Makinna Oestre- ich, et al. The kits19 challenge data: 300 kidney tumor cases with clinical context, ct semantic segmentations, and surgi- cal outcomes. arXiv preprint arXiv:1904.00445 , 2019. 2, 15

  28. [28]

    The kits21 challenge: Automatic segmentation of kidneys, renal tumors, and renal cysts in corticomedullary-phase ct, 2023

    Nicholas Heller, Fabian Isensee, Dasha Trofimova, Re- sha Tejpaul, and et al. The kits21 challenge: Automatic segmentation of kidneys, renal tumors, and renal cysts in corticomedullary-phase ct, 2023. 5

  29. [29]

    Missformer: An effective medical image segmentation transformer

    Xiaohong Huang, Zhifang Deng, Dandan Li, and Xueguang Yuan. Missformer: An effective medical image segmentation transformer. arXiv preprint arXiv:2109.07162, 2021. 17

  30. [30]

    Huang, H

    Ziyan Huang, Haoyu Wang, Zhongying Deng, Jin Ye, Yanzhou Su, Hui Sun, Junjun He, Yun Gu, Lixu Gu, Shaot- ing Zhang, et al. Stu-net: Scalable and transferable med- ical image segmentation models empowered by large-scale supervised pre-training. arXiv preprint arXiv:2304.06716 ,

  31. [31]

    Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison

    Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Sil- viana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, pages 590–597, 2019. 2

  32. [32]

    Jaeger, Simon A.A

    Fabian Isensee, Paul F. Jaeger, Simon A.A. Kohl, Jens Petersen, and Klaus H. Maier-Hein. nnU-Net: a self- configuring method for deep learning-based biomedical im- age segmentation. Nature Methods, 18(2):203–211, 2021. 1, 2, 3, 7, 8, 14, 15, 17, 19

  33. [33]

    nnu-net revisited: A call for rigorous validation in 3d medical image segmentation

    Fabian Isensee, Tassilo Wald, Constantin Ulrich, Michael Baumgartner, Saikat Roy, Klaus Maier-Hein, and Paul F Jaeger. nnu-net revisited: A call for rigorous validation in 3d medical image segmentation. In International Confer- ence on Medical Image Computing and Computer-Assisted Intervention, pages 488–498. Springer, 2024. 1, 2, 4, 5, 7, 8, 15, 16

  34. [34]

    Amos: A large-scale abdominal multi- organ benchmark for versatile medical image segmentation

    Yuanfeng Ji, Haotian Bai, Chongjian Ge, Jie Yang, Ye Zhu, Ruimao Zhang, Zhen Li, Lingyan Zhanng, Wanling Ma, Xiang Wan, et al. Amos: A large-scale abdominal multi- organ benchmark for versatile medical image segmentation. Advances in Neural Information Processing Systems , 35: 36722–36732, 2022. 5, 15

  35. [35]

    Bitr-unet: a cnn-transformer com- bined network for mri brain tumor segmentation

    Qiran Jia and Hai Shu. Bitr-unet: a cnn-transformer com- bined network for mri brain tumor segmentation. In Interna- 10 tional MICCAI Brainlesion Workshop, pages 3–14. Springer,

  36. [36]

    Swinbts: A method for 3d mul- timodal brain tumor segmentation using swin transformer

    Yun Jiang, Yuan Zhang, Xin Lin, Jinkun Dong, Tongtong Cheng, and Jing Liang. Swinbts: A method for 3d mul- timodal brain tumor segmentation using swin transformer. Brain sciences, 12(6):797, 2022. 17

  37. [37]

    Mimic-cxr, a de- identified publicly available database of chest radiographs with free-text reports

    Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de- identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317, 2019. 2

  38. [38]

    Transformers in medical image segmentation: a narrative re- view

    Rabeea Fatma Khan, Byoung-Dai Lee, and Mu Sook Lee. Transformers in medical image segmentation: a narrative re- view. Quantitative Imaging in Medicine and Surgery, 13(12): 8747, 2023. 2

  39. [39]

    Transformers in vision: A survey

    Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1–41, 2022. 1, 16

  40. [40]

    Similarity of neural network representa- tions revisited

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representa- tions revisited. In 36th International Conference on Machine Learning, ICML 2019, pages 6156–6175, 2019. 25

  41. [41]

    Miccai multi-atlas la- beling beyond the cranial vault–workshop and challenge

    Bennett Landman, Zhoubing Xu, J Igelsias, Martin Styner, T Langerak, and Arno Klein. Miccai multi-atlas la- beling beyond the cranial vault–workshop and challenge. In Proc. MICCAI Multi-Atlas Labeling Beyond Cranial Vault—Workshop Challenge , page 12, 2015. 15

  42. [42]

    A systematic collection of medical image datasets for deep learning

    Johann Li, Guangming Zhu, Cong Hua, Mingtao Feng, Ping Li, Xiaoyuan Lu, Juan Song, Peiyi Shen, Xu Xu, Lin Mei, et al. A systematic collection of medical image datasets for deep learning. arXiv preprint arXiv:2106.12864 , 2021. 4, 17, 24

  43. [43]

    Abdomenatlas: A large-scale, detailed- annotated, & multi-center dataset for efficient transfer learn- ing and open algorithmic benchmarking

    Wenxuan Li, Chongyu Qu, Xiaoxi Chen, Pedro RAS Bassi, Yijia Shi, Yuxiang Lai, Qian Yu, Huimin Xue, Yixiong Chen, Xiaorui Lin, et al. Abdomenatlas: A large-scale, detailed- annotated, & multi-center dataset for efficient transfer learn- ing and open algorithmic benchmarking. Medical Image Analysis, 97:103285, 2024. 4, 24

  44. [44]

    Transformer for object detection: Review and benchmark

    Yong Li, Naipeng Miao, Liangdi Ma, Feng Shuang, and Xingwen Huang. Transformer for object detection: Review and benchmark. Engineering Applications of Artificial Intel- ligence, 126:107021, 2023. 16

  45. [45]

    A large, curated, open-source stroke neuroimag- ing dataset to improve lesion segmentation algorithms

    Sook-Lei Liew, Bethany P Lo, Miranda R Donnelly, Artemis Zavaliangos-Petropulu, Jessica N Jeong, Giuseppe Barisano, Alexandre Hutton, Julia P Simon, Julia M Juliano, Anisha Suri, et al. A large, curated, open-source stroke neuroimag- ing dataset to improve lesion segmentation algorithms. Sci- entific data, 9(1):320, 2022. 7

  46. [46]

    Ds-transunet: Dual swin transformer u-net for medical image segmentation

    Ailiang Lin, Bingzhi Chen, Jiayu Xu, Zheng Zhang, Guang- ming Lu, and David Zhang. Ds-transunet: Dual swin transformer u-net for medical image segmentation. IEEE Transactions on Instrumentation and Measurement , 71:1– 15, 2022. 17

  47. [47]

    A survey on deep learning in medical image analysis

    Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Ar- naud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der Laak, Bram Van Gin- neken, and Clara I S ´anchez. A survey on deep learning in medical image analysis. Medical image analysis, 42:60–88,

  48. [48]

    Efficient training of visual trans- formers with small datasets

    Yahui Liu, Enver Sangineto, Wei Bi, Nicu Sebe, Bruno Lepri, and Marco Nadai. Efficient training of visual trans- formers with small datasets. Advances in Neural Information Processing Systems, 34:23818–23830, 2021. 4, 23

  49. [49]

    A survey of visual transformers

    Yang Liu, Yao Zhang, Yixin Wang, Feng Hou, Jin Yuan, Jiang Tian, Yang Zhang, Zhongchao Shi, Jianping Fan, and Zhiqiang He. A survey of visual transformers. IEEE Trans- actions on Neural Networks and Learning Systems, 2023. 16

  50. [50]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 17

  51. [51]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 14

  52. [52]

    Word: A large scale dataset, benchmark and clinical applicable study for abdom- inal organ segmentation from ct image.Medical Image Anal- ysis, 82:102642, 2022

    Xiangde Luo, Wenjun Liao, Jianghong Xiao, Jieneng Chen, Tao Song, Xiaofan Zhang, Kang Li, Dimitris N Metaxas, Guotai Wang, and Shaoting Zhang. Word: A large scale dataset, benchmark and clinical applicable study for abdom- inal organ segmentation from ct image.Medical Image Anal- ysis, 82:102642, 2022. 7

  53. [53]

    Automatic organ and pan-cancer segmentation in abdomen ct: the flare 2023 challenge.arXiv preprint arXiv:2408.12534,

    Jun Ma, Yao Zhang, Song Gu, Cheng Ge, Ershuai Wang, Qin Zhou, Ziyan Huang, Pengju Lyu, Jian He, and Bo Wang. Au- tomatic organ and pan-cancer segmentation in abdomen ct: the flare 2023 challenge. arXiv preprint arXiv:2408.12534,

  54. [54]

    Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth

    Thao Nguyen, Maithra Raghu, and Simon Kornblith. Do wide and deep networks learn the same things? uncover- ing how neural network representations vary with width and depth. arXiv preprint arXiv:2010.15327, 2020. 25

  55. [55]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2

  56. [56]

    A robust volumetric transformer for accurate 3d tumor segmentation

    Himashi Peiris, Munawar Hayat, Zhaolin Chen, Gary Egan, and Mehrtash Harandi. A robust volumetric transformer for accurate 3d tumor segmentation. In International Confer- ence on Medical Image Computing and Computer-Assisted Intervention, pages 162–172. Springer, 2022. 17

  57. [57]

    U-net transformer: Self and cross attention for medical image segmentation

    Olivier Petit, Nicolas Thome, Clement Rambour, Loic The- myr, Toby Collins, and Luc Soler. U-net transformer: Self and cross attention for medical image segmentation. In Machine Learning in Medical Imaging: 12th International Workshop, MLMI 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, September 27, 2021, Proceedings 12, pages 267–276. S...

  58. [58]

    Automated detection and quantification of brain metastases on clinical mri data using artificial neural networks

    Irada Pfl ¨uger, Tassilo Wald, Fabian Isensee, Marianne Schell, Hagen Meredig, Kai Schlamp, Denise Bernhardt, Gi- anluca Brugnara, Claus Peter Heußel, Juergen Debus, et al. Automated detection and quantification of brain metastases on clinical mri data using artificial neural networks. Neuro- oncology advances, 4(1):vdac138, 2022. 8

  59. [59]

    Abdomenatlas-8k: An- notating 8,000 ct volumes for multi-organ segmentation in 11 three weeks

    Chongyu Qu, Tiezheng Zhang, Hualin Qiao, Yucheng Tang, Alan L Yuille, Zongwei Zhou, et al. Abdomenatlas-8k: An- notating 8,000 ct volumes for multi-organ segmentation in 11 three weeks. Advances in Neural Information Processing Systems, 36, 2024. 4, 24

  60. [60]

    Mednext: transformer-driven scal- ing of convnets for medical image segmentation

    Saikat Roy, Gregor Koehler, Constantin Ulrich, Michael Baumgartner, Jens Petersen, Fabian Isensee, Paul F Jaeger, and Klaus H Maier-Hein. Mednext: transformer-driven scal- ing of convnets for medical image segmentation. In In- ternational Conference on Medical Image Computing and Computer-Assisted Intervention , pages 405–415. Springer,

  61. [61]

    Berg, and Li Fei-Fei

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Chal- lenge. International Journal of Computer Vision (IJCV), 115 (3):211–252, 2015. 23

  62. [62]

    Transformers in medical imaging: A survey

    Fahad Shamshad, Salman Khan, Syed Waqas Zamir, Muhammad Haris Khan, Munawar Hayat, Fahad Shahbaz Khan, and Huazhu Fu. Transformers in medical imaging: A survey. Medical Image Analysis, page 102802, 2023. 17

  63. [63]

    The curious case of absolute position embeddings.arXiv preprint arXiv:2210.12574, 2022

    Koustuv Sinha, Amirhossein Kazemnejad, Siva Reddy, Joelle Pineau, Dieuwke Hupkes, and Adina Williams. The curious case of absolute position embeddings.arXiv preprint arXiv:2210.12574, 2022. 6

  64. [64]

    Feature selection via dependence maxi- mization

    Le Song, Alex Smola, Arthur Gretton, Justin Bedo, and Karsten Borgwardt. Feature selection via dependence maxi- mization. Journal of Machine Learning Research, 13:1393– 1434, 2012. 25

  65. [65]

    From generalist to specialist: Incorporating domain-knowledge into flamingo for chest x- ray report generation

    Raphael Stock, Stefan Denner, Yannick Kirchhoff, Con- stantin Ulrich, Maximilian Rouven Rokuss, Saikat Roy, Nico Disch, and Klaus Maier-Hein. From generalist to specialist: Incorporating domain-knowledge into flamingo for chest x- ray report generation. In Medical Imaging with Deep Learn- ing, 2024. 2

  66. [66]

    Segmenter: Transformer for semantic segmenta- tion

    Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmenta- tion. In Proceedings of the IEEE/CVF international confer- ence on computer vision, pages 7262–7272, 2021. 16

  67. [67]

    Roformer: Enhanced transformer with rotary position embedding

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063,

  68. [68]

    Revisiting unreasonable effectiveness of data in deep learning era

    Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhi- nav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 843–852, 2017. 23

  69. [69]

    Self-supervised pre-training of swin trans- formers for 3d medical image analysis

    Yucheng Tang, Dong Yang, Wenqi Li, Holger R Roth, Bennett Landman, Daguang Xu, Vishwesh Nath, and Ali Hatamizadeh. Self-supervised pre-training of swin trans- formers for 3d medical image analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20730–20740, 2022. 17

  70. [70]

    Training data-efficient image transformers & distillation through at- tention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through at- tention. In International conference on machine learning , pages 10347–10357. PMLR, 2021. 1, 4

  71. [71]

    Going deeper with im- age transformers

    Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Herv´e J´egou. Going deeper with im- age transformers. In Proceedings of the IEEE/CVF interna- tional conference on computer vision, pages 32–42, 2021. 6

  72. [72]

    Deit iii: Revenge of the vit

    Hugo Touvron, Matthieu Cord, and Herv ´e J ´egou. Deit iii: Revenge of the vit. In European conference on computer vision, pages 516–533. Springer, 2022. 4

  73. [73]

    Multitalent: A multi-dataset approach to medical image seg- mentation

    Constantin Ulrich, Fabian Isensee, Tassilo Wald, Maximil- ian Zenk, Michael Baumgartner, and Klaus H Maier-Hein. Multitalent: A multi-dataset approach to medical image seg- mentation. arXiv preprint arXiv:2303.14444, 2023. 2, 24

  74. [74]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. IEEE Industry Ap- plications Magazine, 8(1):8–15, 2017. 1

  75. [75]

    Transbts: Multimodal brain tumor seg- mentation using transformer

    Wenxuan Wang, Chen Chen, Meng Ding, Hong Yu, Sen Zha, and Jiangyun Li. Transbts: Multimodal brain tumor seg- mentation using transformer. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th In- ternational Conference, Strasbourg, France, September 27– October 1, 2021, Proceedings, Part I 24 , pages 109–119. Springer, 2021. 1, 2...

  76. [76]

    Medclip: Contrastive learning from unpaired medical images and text

    Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text. arXiv preprint arXiv:2210.10163, 2022. 2

  77. [77]

    URL http://arxiv.org/abs/2208.05868

    Jakob Wasserthal, M. Meyer, Hanns-Christian Breit, Joshy Cyriac, Shan Yang, and Martin Segeroth. Totalsegmentator: robust segmentation of 104 anatomical structures in ct im- ages. ArXiv, abs/2208.05868, 2022. 2, 4, 15, 24

  78. [78]

    High-resolution swin transformer for automatic medical image segmentation

    Chen Wei, Shenghan Ren, Kaitai Guo, Haihong Hu, and Jimin Liang. High-resolution swin transformer for automatic medical image segmentation. Sensors, 23(7):3420, 2023. 5

  79. [79]

    D-former: A u- shaped dilated transformer for 3d medical image segmenta- tion

    Yixuan Wu, Kuanlun Liao, Jintai Chen, Jinhong Wang, Danny Z Chen, Honghao Gao, and Jian Wu. D-former: A u- shaped dilated transformer for 3d medical image segmenta- tion. Neural Computing and Applications, 35(2):1931–1944,

  80. [80]

    Transformers in medical image segmentation: A review

    Hanguang Xiao, Li Li, Qiyuan Liu, Xiuhong Zhu, and Qi- hang Zhang. Transformers in medical image segmentation: A review. Biomedical Signal Processing and Control , 84: 104791, 2023. 17

Showing first 80 references.