S2M-Net: Spectral-Spatial Mixing for Medical Image Segmentation with Morphology-Aware Adaptive Loss

Md. Sanaullah Chowdhury Lameya Sabrin

arxiv: 2601.01285 · v2 · submitted 2026-01-03 · 💻 cs.CV

S2M-Net: Spectral-Spatial Mixing for Medical Image Segmentation with Morphology-Aware Adaptive Loss

Md. Sanaullah Chowdhury Lameya Sabrin This is my paper

Pith reviewed 2026-05-16 17:26 UTC · model grok-4.3

classification 💻 cs.CV

keywords medical image segmentationspectral token mixerFFT-based mixingmorphology-aware losslightweight segmentationglobal contextadaptive loss functionDice score

0 comments

The pith

S2M-Net achieves state-of-the-art medical image segmentation across 16 datasets using a 4.7M-parameter model that mixes spectral and spatial features at logarithmic cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

S2M-Net addresses the tension between local boundary precision, global anatomical context, and computational limits in medical segmentation. It replaces quadratic self-attention with a spectral token mixer that applies truncated 2D FFT, learnable frequency filters, and gated spatial projection to capture long-range dependencies efficiently. A separate morphology-aware loss automatically tunes five complementary terms according to each structure's compactness, shape, and scale, removing the need for manual per-dataset adjustments. The resulting network reports leading Dice scores on polyp, instrument, and tumor tasks while using several times fewer parameters than transformer baselines. If the approach holds, it indicates that medical images' natural frequency properties can supply global context without the data hunger or compute cost of attention mechanisms.

Core claim

The paper establishes that medical images possess sufficient spectral concentration to allow a truncated 2D FFT with learnable frequency filtering and content-gated projection to deliver global receptive fields at O(HW log HW) cost; when paired with a morphology-aware adaptive loss that analyzes compactness, tubularity, irregularity, and scale to modulate loss weights via constrained learnable parameters, the 4.7M-parameter S2M-Net produces superior segmentation accuracy on 16 datasets spanning 8 modalities compared with both convolutional and transformer baselines.

What carries the argument

The Spectral-Selective Token Mixer (SSTM), which exploits spectral concentration through truncated 2D FFT, learnable frequency filtering, and content-gated spatial projection to supply global context, together with the Morphology-Aware Adaptive Segmentation Loss (MASL) that inspects structure properties to modulate five loss components.

If this is right

Global context becomes available at linearithmic rather than quadratic cost, enabling larger input resolutions on clinical hardware.
Loss tuning no longer requires separate hyperparameter searches for each new modality or organ.
Performance gains of 3-18% Dice appear consistently across polyp, instrument, and tumor segmentation tasks.
Model size stays under 5 million parameters, supporting deployment on edge devices with limited memory.
The same architecture works across eight imaging modalities without architectural changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The spectral-mixing pattern may transfer to other domains with concentrated frequency content, such as remote-sensing or microscopic imaging.
The automatic morphology analysis could shorten training pipelines by removing the need for extensive loss-function engineering.
If the FFT truncation proves robust, the method opens a route to real-time segmentation in operating rooms without cloud offloading.
Combining the adaptive loss with other frequency-domain operators might further improve multi-scale boundary detection.

Load-bearing premise

Medical images contain enough concentrated spectral energy that truncating the 2D FFT preserves critical boundary and anatomical information, and that automatic analysis of morphological features can produce loss weights that generalize without overfitting or dataset-specific bias.

What would settle it

Evaluation on a medical dataset whose Fourier spectrum is deliberately diffuse rather than concentrated, checking whether SSTM accuracy falls below attention-based models while keeping parameter count fixed.

Figures

Figures reproduced from arXiv: 2601.01285 by Md. Sanaullah Chowdhury Lameya Sabrin.

**Figure 2.** Figure 2: S2M-Net overview. The encoder has five stages with MRF-SE blocks and stage-wise global mixing via SSTM, while the decoder uses a boundary-focused pathway with soft spatial routing to refine region and boundary predictions. ture combining spectral processing for global context with multi-scale convolutions for local precision. Our design targets clinical deployment scenarios with limited training data (N<1… view at source ↗

read the original abstract

Medical image segmentation requires balancing local precision for boundary-critical clinical applications, global context for anatomical coherence, and computational efficiency for deployment on limited data and hardware a trilemma that existing architectures fail to resolve. Although convolutional networks provide local precision at $\mathcal{O}(n)$ cost but limited receptive fields, vision transformers achieve global context through $\mathcal{O}(n^2)$ self-attention at prohibitive computational expense, causing overfitting on small clinical datasets. We propose S2M-Net, a 4.7M-parameter architecture that achieves $\mathcal{O}(HW \log HW)$ global context through two synergistic innovations: (i) Spectral-Selective Token Mixer (SSTM), which exploits the spectral concentration of medical images via truncated 2D FFT with learnable frequency filtering and content-gated spatial projection, avoiding quadratic attention cost while maintaining global receptive fields; and (ii) Morphology-Aware Adaptive Segmentation Loss (MASL), which automatically analyzes structure characteristics (compactness, tubularity, irregularity, scale) to modulate five complementary loss components through constrained learnable weights, eliminating manual per-dataset tuning. Comprehensive evaluation in 16 medical imaging datasets that span 8 modalities demonstrates state-of-the-art performance: 96.12\% Dice on polyp segmentation, 83.77\% on surgical instruments (+17.85\% over the prior art) and 80.90\% on brain tumors, with consistent 3-18\% improvements over specialized baselines while using 3.5--6$\times$ fewer parameters than transformer-based methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

S2M-Net swaps quadratic attention for truncated FFT mixing plus an adaptive morphology loss and reports solid Dice gains on 16 datasets with 4.7M params, but the abstract leaves the high-frequency boundary question and lack of ablations unresolved.

read the letter

The core contribution is a lightweight segmentation network that replaces self-attention with a Spectral-Selective Token Mixer: truncated 2D FFT, learnable frequency filters, and a content-gated spatial projection. Paired with that is the Morphology-Aware Adaptive Segmentation Loss, which estimates compactness, tubularity, and irregularity to set constrained weights on five loss terms. Both pieces are new in this combination for medical images, and the efficiency story (O(HW log HW) global context, 3.5-6x fewer parameters than transformers) is the main practical hook. The reported numbers—96.12% Dice on polyps, 83.77% on instruments, 80.90% on brain tumors—look competitive across eight modalities if they hold up. The paper does a reasonable job framing the trilemma of local precision, global context, and compute on small clinical data. The stress-test concern about spectral truncation discarding boundary-critical high frequencies is worth taking seriously; medical images do concentrate energy in low frequencies, but polyps and instrument tips live in the details, and the abstract does not show whether the learnable filter and gating recover enough of them. The absence of ablations, error bars, or baseline implementation details in the summary also makes it hard to attribute the gains cleanly to the new modules rather than dataset-specific tuning. This is the kind of work that would interest readers building deployable clinical tools who already know the standard U-Net and transformer baselines. If the full paper supplies the missing ablations, statistical tests, and code, it deserves a serious referee; otherwise the efficiency claim stays provisional.

Referee Report

4 major / 2 minor

Summary. The manuscript proposes S2M-Net, a 4.7M-parameter model for medical image segmentation that uses a Spectral-Selective Token Mixer (SSTM) employing truncated 2D FFT with learnable frequency filtering and content-gated projection to achieve O(HW log HW) global context, combined with a Morphology-Aware Adaptive Segmentation Loss (MASL) that modulates loss components via constrained learnable weights based on morphological analysis. It reports state-of-the-art Dice scores on 16 datasets from 8 modalities, including 96.12% on polyps and 83.77% on surgical instruments, with 3-18% improvements over baselines and 3.5-6x fewer parameters than transformers.

Significance. If the claimed gains are robustly validated, S2M-Net could offer a practical solution to the efficiency-accuracy trade-off in medical segmentation, enabling deployment on resource-limited hardware while handling diverse modalities without manual loss tuning. The use of FFT for global mixing is a strength if the spectral assumptions hold.

major comments (4)

[§3.1 (SSTM)] §3.1 (SSTM): The truncation of 2D FFT coefficients is central to the efficiency claim, but the manuscript does not provide ablation experiments comparing truncated vs. full-spectrum versions or quantifying information loss in high-frequency bands critical for boundaries (e.g., polyp edges). This is load-bearing because the skeptic concern directly questions whether the reported Dice gains can be attributed to the global context mechanism.
[Results (Tables 1-3)] Results (Tables 1-3): No ablation studies, error bars, or statistical significance tests (e.g., paired t-tests) are reported for the performance improvements. Baseline implementations lack details on hyperparameters and training protocols, undermining the ability to confirm that gains stem from the proposed innovations rather than experimental setup differences.
[§4.2 (MASL)] §4.2 (MASL): The constrained learnable weights in MASL are optimized jointly with the network on the same training data used for morphology analysis, creating a risk of circularity where the adaptation fits dataset-specific traits. Without cross-validation or hold-out tests for the loss weights, the generalization claim across 16 datasets is not fully supported.
[Methods] Methods: The paper lacks experiments isolating the contribution of the content-gated spatial projection in SSTM and the individual components of the five-loss MASL, making it hard to assess if both innovations are necessary for the claimed performance.

minor comments (2)

[Abstract] Abstract: The abstract mentions 'consistent 3-18% improvements' but does not specify the exact baselines for each percentage; a table reference would improve clarity.
[Figure 3] Figure 3: The morphology analysis visualization could include quantitative metrics for compactness and tubularity to aid interpretation.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§3.1 (SSTM)] The truncation of 2D FFT coefficients is central to the efficiency claim, but the manuscript does not provide ablation experiments comparing truncated vs. full-spectrum versions or quantifying information loss in high-frequency bands critical for boundaries (e.g., polyp edges). This is load-bearing because the skeptic concern directly questions whether the reported Dice gains can be attributed to the global context mechanism.

Authors: We agree that direct ablations on truncation are required to substantiate the efficiency claims. In the revised manuscript we will add experiments comparing the truncated SSTM against a full-spectrum variant on representative datasets, together with quantitative measures of high-frequency information loss (boundary F1 and edge-preservation metrics) to demonstrate that the reported gains can be attributed to the proposed global-context mechanism. revision: yes
Referee: [Results (Tables 1-3)] No ablation studies, error bars, or statistical significance tests (e.g., paired t-tests) are reported for the performance improvements. Baseline implementations lack details on hyperparameters and training protocols, undermining the ability to confirm that gains stem from the proposed innovations rather than experimental setup differences.

Authors: We acknowledge the absence of statistical validation and reproducibility details. We will expand the results section with error bars from multiple random seeds, paired t-tests for significance, and complete hyperparameter and training-protocol specifications for all baselines to allow independent verification that the gains arise from the proposed innovations. revision: yes
Referee: [§4.2 (MASL)] The constrained learnable weights in MASL are optimized jointly with the network on the same training data used for morphology analysis, creating a risk of circularity where the adaptation fits dataset-specific traits. Without cross-validation or hold-out tests for the loss weights, the generalization claim across 16 datasets is not fully supported.

Authors: The risk of circularity in the joint optimization of MASL weights is a legitimate concern. We will add experiments that optimize the loss weights via k-fold cross-validation on training folds and evaluate the resulting weights on held-out test sets across multiple datasets, thereby providing stronger evidence for generalization of the morphology-aware adaptation. revision: yes
Referee: [Methods] The paper lacks experiments isolating the contribution of the content-gated spatial projection in SSTM and the individual components of the five-loss MASL, making it hard to assess if both innovations are necessary for the claimed performance.

Authors: We agree that component-wise ablations are necessary. In the revision we will include targeted ablation studies that isolate the content-gated spatial projection within SSTM and that evaluate each of the five loss terms in MASL individually, thereby clarifying whether both innovations are required for the observed performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper presents S2M-Net as an empirical architecture combining a Spectral-Selective Token Mixer (truncated 2D FFT with learnable filtering and gated projection) and a Morphology-Aware Adaptive Segmentation Loss (structure analysis modulating five loss terms via constrained learnable weights). No equations, definitions, or claims reduce any reported performance metric or architectural property to its own inputs by construction. The O(HW log HW) complexity follows directly from the FFT operation, which is a standard property independent of the paper's results. Learnable weights in MASL are optimized during training but are not presented as 'predictions' of held-out quantities; they are design parameters whose effect is measured on separate test sets across 16 datasets. No self-citations are invoked to justify uniqueness or forbid alternatives, and no ansatz is smuggled via prior work. The derivation chain consists of standard signal-processing primitives plus an adaptive loss formulation, with all quantitative claims being externally falsifiable empirical outcomes rather than tautologies.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claims rest on two new components whose behavior depends on domain assumptions about medical-image spectra and on learnable parameters that are fitted during training. No external benchmarks or formal proofs are provided for the frequency truncation strategy or the morphology-to-weight mapping.

free parameters (2)

learnable frequency filters
Truncated 2D FFT stage uses learnable filters to select frequencies; these are optimized on training data.
constrained learnable loss weights
MASL modulates five loss components via weights that are learned subject to constraints but still fitted to the segmentation task.

axioms (2)

domain assumption Medical images exhibit spectral concentration that allows truncated FFT to capture global context without critical information loss
Invoked to justify the SSTM design in the abstract.
domain assumption Structure morphology (compactness, tubularity, irregularity, scale) can be automatically extracted and used to modulate loss terms in a generalizable way
Core premise of the MASL component.

invented entities (2)

Spectral-Selective Token Mixer (SSTM) no independent evidence
purpose: Provide O(HW log HW) global context via FFT-based mixing instead of quadratic attention
New architectural block introduced by the paper.
Morphology-Aware Adaptive Segmentation Loss (MASL) no independent evidence
purpose: Automatically balance multiple loss terms according to detected structure properties
New loss formulation introduced by the paper.

pith-pipeline@v0.9.0 · 5581 in / 1999 out tokens · 41275 ms · 2026-05-16T17:26:01.296570+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 4 internal anchors

[1]

U-Net: Convo- lutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convo- lutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Interven- tion (MICCAI), pp. 234–241, Springer, 2015

work page 2015
[2]

nnu-net: A self-configuring method for deep learning-based biomedical image segmentation,

F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier-Hein, “nnu-net: A self-configuring method for deep learning-based biomedical image segmentation,”Na- ture Methods, vol. 18, no. 2, pp. 203–211, 2021

work page 2021
[3]

Unet++: A nested u-net architecture for medical image segmentation,

Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: A nested u-net architecture for medical image segmentation,” inDeep Learning in Medical Image Analy- sis and Multimodal Learning for Clinical Decision Support, pp. 3–11, Springer, 2018

work page 2018
[4]

Ce-net: Context encoder network for 2d medical image segmentation,

Y . Chen, W. Yu, T. Zhang,et al., “Ce-net: Context encoder network for 2d medical image segmentation,”IEEE Trans- actions on Medical Imaging, vol. 38, no. 10, pp. 2281–2292, 2019

work page 2019
[5]

Attention U-Net: Learning Where to Look for the Pancreas

O. Oktay, J. Schlemper, L. L. Folgoc,et al., “Attention u- net: Learning where to look for the pancreas,”arXiv preprint arXiv:1804.03999, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Multi-scale context aggregation by di- lated convolutions,

F. Yu and V . Koltun, “Multi-scale context aggregation by di- lated convolutions,” inInternational Conference on Learning Representations (ICLR), 2016

work page 2016
[7]

Encoder-decoder with atrous separable convolution for se- mantic image segmentation,

L.-C. Chen, Y . Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for se- mantic image segmentation,” inEuropean Conference on Computer Vision (ECCV), pp. 801–818, 2018

work page 2018
[8]

Boundary loss for highly unbalanced segmentation,

H. Kervadec, J. Bouchtiba, C. Desrosiers,et al., “Boundary loss for highly unbalanced segmentation,” inMedical Imag- ing with Deep Learning (MIDL), 2019

work page 2019
[9]

Topology- preserving deep image segmentation,

X. Hu, L. Li, D. Samaras, and C. Chen, “Topology- preserving deep image segmentation,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 32, 2019

work page 2019
[10]

Transunet: Transformers make strong encoders for medical image segmentation,

J. Chen, Y . Lu, Q. Yu, X. Luo, E. Adeli, Y . Wang, L. Lu, A. L. Yuille, and Y . Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” 2021

work page 2021
[11]

Unetr: Transformers for 3d medical image segmentation,

A. Hatamizadeh, D. Yang, H. Roth, and D. Xu, “Unetr: Transformers for 3d medical image segmentation,” inWin- ter Conference on Applications of Computer Vision (WACV), pp. 574–584, 2022

work page 2022
[12]

Fnet: Mixing tokens with fourier transforms,

J. Lee-Thorp, J. Ainslie, I. Eckstein, and S. Onta ˜n´on, “Fnet: Mixing tokens with fourier transforms,” inProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 4294–4305, 2022

work page 2022
[13]

Global filter networks for image classification,

Y . Rao, W. Zhao, B. Tang,et al., “Global filter networks for image classification,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 980–993, 2021

work page 2021
[14]

Fourier neural operator for parametric partial differential equations,

Z. Li, N. Kovachki, K. Azizzadenesheli,et al., “Fourier neural operator for parametric partial differential equations,” inInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[15]

Efficiently modeling long se- quences with structured state spaces,

A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long se- quences with structured state spaces,” inInternational Con- ference on Learning Representations (ICLR), 2022

work page 2022
[16]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Self-supervised pre-training of swin transformers for 3d medical image analysis,

Y . Tang, D. Yang, W. Li,et al., “Self-supervised pre-training of swin transformers for 3d medical image analysis,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20730–20740, 2022

work page 2022
[18]

U-mamba: Enhancing long- range dependency for biomedical image segmentation,

J. Ma, F. Li, and B. Wang, “U-mamba: Enhancing long- range dependency for biomedical image segmentation,” 2024

work page 2024
[19]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll´ar, and R. Girshick, “Segment anything,” 2023

work page 2023
[20]

Seg- ment anything in medical images,

J. Ma, Y . He, F. Li, L. Han, C. You, and B. Wang, “Seg- ment anything in medical images,”Nature Communications, vol. 15, Jan. 2024

work page 2024
[21]

Faster segment anything: Towards lightweight sam for mobile applications,

C. Zhang, D. Han, Y . Qiao, J. U. Kim, S.-H. Bae, S. Lee, and C. S. Hong, “Faster segment anything: Towards lightweight sam for mobile applications,” 2023

work page 2023
[22]

C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. Jorge Cardoso,Generalised Dice Overlap as a Deep Learning Loss Function for Highly Unbalanced Segmenta- tions, p. 240–248. Springer International Publishing, 2017

work page 2017
[23]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal loss for dense object detection,” 2018

work page 2018
[24]

Multi-task learning us- ing uncertainty to weigh losses for scene geometry and se- mantics,

A. Kendall, Y . Gal, and R. Cipolla, “Multi-task learning us- ing uncertainty to weigh losses for scene geometry and se- mantics,” 2018

work page 2018
[25]

Mobilenets: Effi- cient convolutional neural networks for mobile vision appli- cations,

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Effi- cient convolutional neural networks for mobile vision appli- cations,” 2017

work page 2017
[26]

Squeeze-and-excitation net- works,

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation net- works,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018

work page 2018
[27]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar,et al., “Attention is all you need,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017

work page 2017
[28]

Global filter networks for image classification,

Y . Rao, W. Zhao, Z. Zhu, J. Lu, and J. Zhou, “Global filter networks for image classification,” 2021

work page 2021
[29]

Kvasir-seg: A segmented polyp dataset,

D. Jha, P. H. Smedsrud, M. A. Riegler, P. Halvorsen, T. de Lange, D. Johansen, and H. D. Johansen, “Kvasir-seg: A segmented polyp dataset,” inInternational Conference on Multimedia Modeling (MMM), pp. 451–462, Springer, 2020

work page 2020
[30]

Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,

J. Bernal, F. J. S ´anchez, G. Fern ´andez-Esparrach, D. Gil, C. Rodr´ıguez, and F. Vilari˜no, “Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,”Computerized Medical Imaging and Graphics, vol. 43, pp. 99–111, 2015

work page 2015
[31]

An efficient polyp segmentation network,

T. Erol and D. Sarikaya, “An efficient polyp segmentation network,” 2022

work page 2022
[32]

Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer,

J. Silva, A. Histace, O. Romain, X. Dray, and B. Granado, “Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer,”International Journal of Computer Assisted Radiology and Surgery, vol. 9, no. 2, pp. 283–293, 2014

work page 2014
[33]

Pranet: Parallel reverse attention network for polyp segmentation,

D.-P. Fan, G.-P. Ji, T. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao, “Pranet: Parallel reverse attention network for polyp segmentation,” inInternational Conference on Medical Im- age Computing and Computer-Assisted Intervention (MIC- CAI), pp. 263–273, Springer, 2020

work page 2020
[34]

Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

N. Codella, V . Rotemberg, P. Tschandl, M. E. Celebi, S. Dusza, D. Gutman, B. Helba, A. Kalloo, K. Liopy- ris, M. Marchetti,et al., “Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the inter- national skin imaging collaboration (isic),” inarXiv preprint arXiv:1902.03368, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2018
[35]

Ph2-a dermoscopic image database for re- search and benchmarking,

T. Mendonc ¸a, P. M. Ferreira, J. S. Marques, A. R. Marcal, and J. Rozeira, “Ph2-a dermoscopic image database for re- search and benchmarking,” inAnnual International Confer- ence of the IEEE Engineering in Medicine and Biology So- ciety (EMBC), pp. 5437–5440, IEEE, 2013

work page 2013
[36]

Dataset of breast ultrasound images,

W. Al-Dhabyani, M. Gomaa, H. Khaled, and A. Fahmy, “Dataset of breast ultrasound images,”Data in Brief, vol. 28, p. 104863, 2020

work page 2020
[37]

Gland segmentation in colon histology images: The glas challenge contest,

K. Sirinukunwattana, J. P. Pluim, H. Chen, X. Qi, P.-A. Heng, Y . B. Guo, L. Y . Wang, B. J. Matuszewski, E. Bruni, U. Sanchez,et al., “Gland segmentation in colon histology images: The glas challenge contest,”Medical Image Analy- sis, vol. 35, pp. 489–502, 2017

work page 2017
[38]

The multimodal brain tumor image seg- mentation benchmark (brats),

B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y . Burren, N. Porz, J. Slotboom, R. Wiest,et al., “The multimodal brain tumor image seg- mentation benchmark (brats),”IEEE Transactions on Medi- cal Imaging, vol. 34, no. 10, pp. 1993–2024, 2014

work page 1993
[39]

The liver tumor segmentation benchmark (lits),

P. Bilic, P. F. Christ, E. V orontsov, G. Chlebus, H. Chen, Q. Dou, C.-W. Fu, X. Han, P.-A. Heng, J. Hesser,et al., “The liver tumor segmentation benchmark (lits),”Medical Image Analysis, vol. 84, p. 102680, 2023

work page 2023
[40]

2017 Robotic Instrument Segmentation Challenge

M. Allan, A. Shvets, T. Kurmann, Z. Zhang, R. Duggal, Y .- H. Su, N. Rieke, I. Laina, N. Kalavakonda, S. Bodenstedt, et al., “2017 robotic instrument segmentation challenge,” arXiv preprint arXiv:1902.06426, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

Ridge-based vessel segmentation in color images of the retina,

J. Staal, M. D. Abr `amoff, M. Niemeijer, M. A. Viergever, and B. Van Ginneken, “Ridge-based vessel segmentation in color images of the retina,”IEEE Transactions on Medical Imaging, vol. 23, no. 4, pp. 501–509, 2004

work page 2004
[42]

An ensemble classification-based approach applied to retinal blood vessel segmentation,

M. M. Fraz, P. Remagnino, A. Hoppe, B. Uyyanonvara, A. R. Rudnicka, C. G. Owen, and S. A. Barman, “An ensemble classification-based approach applied to retinal blood vessel segmentation,”IEEE Transactions on Biomedical Engineer- ing, vol. 59, no. 9, pp. 2538–2548, 2012

work page 2012
[43]

Locating blood vessels in retinal images by piecewise threshold prob- ing of a matched filter response,

A. Hoover, V . Kouznetsova, and M. Goldbaum, “Locating blood vessels in retinal images by piecewise threshold prob- ing of a matched filter response,”IEEE Transactions on Med- ical Imaging, vol. 19, no. 3, pp. 203–210, 2000

work page 2000

[1] [1]

U-Net: Convo- lutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convo- lutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Interven- tion (MICCAI), pp. 234–241, Springer, 2015

work page 2015

[2] [2]

nnu-net: A self-configuring method for deep learning-based biomedical image segmentation,

F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier-Hein, “nnu-net: A self-configuring method for deep learning-based biomedical image segmentation,”Na- ture Methods, vol. 18, no. 2, pp. 203–211, 2021

work page 2021

[3] [3]

Unet++: A nested u-net architecture for medical image segmentation,

Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: A nested u-net architecture for medical image segmentation,” inDeep Learning in Medical Image Analy- sis and Multimodal Learning for Clinical Decision Support, pp. 3–11, Springer, 2018

work page 2018

[4] [4]

Ce-net: Context encoder network for 2d medical image segmentation,

Y . Chen, W. Yu, T. Zhang,et al., “Ce-net: Context encoder network for 2d medical image segmentation,”IEEE Trans- actions on Medical Imaging, vol. 38, no. 10, pp. 2281–2292, 2019

work page 2019

[5] [5]

Attention U-Net: Learning Where to Look for the Pancreas

O. Oktay, J. Schlemper, L. L. Folgoc,et al., “Attention u- net: Learning where to look for the pancreas,”arXiv preprint arXiv:1804.03999, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Multi-scale context aggregation by di- lated convolutions,

F. Yu and V . Koltun, “Multi-scale context aggregation by di- lated convolutions,” inInternational Conference on Learning Representations (ICLR), 2016

work page 2016

[7] [7]

Encoder-decoder with atrous separable convolution for se- mantic image segmentation,

L.-C. Chen, Y . Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for se- mantic image segmentation,” inEuropean Conference on Computer Vision (ECCV), pp. 801–818, 2018

work page 2018

[8] [8]

Boundary loss for highly unbalanced segmentation,

H. Kervadec, J. Bouchtiba, C. Desrosiers,et al., “Boundary loss for highly unbalanced segmentation,” inMedical Imag- ing with Deep Learning (MIDL), 2019

work page 2019

[9] [9]

Topology- preserving deep image segmentation,

X. Hu, L. Li, D. Samaras, and C. Chen, “Topology- preserving deep image segmentation,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 32, 2019

work page 2019

[10] [10]

Transunet: Transformers make strong encoders for medical image segmentation,

J. Chen, Y . Lu, Q. Yu, X. Luo, E. Adeli, Y . Wang, L. Lu, A. L. Yuille, and Y . Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” 2021

work page 2021

[11] [11]

Unetr: Transformers for 3d medical image segmentation,

A. Hatamizadeh, D. Yang, H. Roth, and D. Xu, “Unetr: Transformers for 3d medical image segmentation,” inWin- ter Conference on Applications of Computer Vision (WACV), pp. 574–584, 2022

work page 2022

[12] [12]

Fnet: Mixing tokens with fourier transforms,

J. Lee-Thorp, J. Ainslie, I. Eckstein, and S. Onta ˜n´on, “Fnet: Mixing tokens with fourier transforms,” inProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 4294–4305, 2022

work page 2022

[13] [13]

Global filter networks for image classification,

Y . Rao, W. Zhao, B. Tang,et al., “Global filter networks for image classification,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 980–993, 2021

work page 2021

[14] [14]

Fourier neural operator for parametric partial differential equations,

Z. Li, N. Kovachki, K. Azizzadenesheli,et al., “Fourier neural operator for parametric partial differential equations,” inInternational Conference on Learning Representations (ICLR), 2021

work page 2021

[15] [15]

Efficiently modeling long se- quences with structured state spaces,

A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long se- quences with structured state spaces,” inInternational Con- ference on Learning Representations (ICLR), 2022

work page 2022

[16] [16]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Self-supervised pre-training of swin transformers for 3d medical image analysis,

Y . Tang, D. Yang, W. Li,et al., “Self-supervised pre-training of swin transformers for 3d medical image analysis,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20730–20740, 2022

work page 2022

[18] [18]

U-mamba: Enhancing long- range dependency for biomedical image segmentation,

J. Ma, F. Li, and B. Wang, “U-mamba: Enhancing long- range dependency for biomedical image segmentation,” 2024

work page 2024

[19] [19]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll´ar, and R. Girshick, “Segment anything,” 2023

work page 2023

[20] [20]

Seg- ment anything in medical images,

J. Ma, Y . He, F. Li, L. Han, C. You, and B. Wang, “Seg- ment anything in medical images,”Nature Communications, vol. 15, Jan. 2024

work page 2024

[21] [21]

Faster segment anything: Towards lightweight sam for mobile applications,

C. Zhang, D. Han, Y . Qiao, J. U. Kim, S.-H. Bae, S. Lee, and C. S. Hong, “Faster segment anything: Towards lightweight sam for mobile applications,” 2023

work page 2023

[22] [22]

C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. Jorge Cardoso,Generalised Dice Overlap as a Deep Learning Loss Function for Highly Unbalanced Segmenta- tions, p. 240–248. Springer International Publishing, 2017

work page 2017

[23] [23]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal loss for dense object detection,” 2018

work page 2018

[24] [24]

Multi-task learning us- ing uncertainty to weigh losses for scene geometry and se- mantics,

A. Kendall, Y . Gal, and R. Cipolla, “Multi-task learning us- ing uncertainty to weigh losses for scene geometry and se- mantics,” 2018

work page 2018

[25] [25]

Mobilenets: Effi- cient convolutional neural networks for mobile vision appli- cations,

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Effi- cient convolutional neural networks for mobile vision appli- cations,” 2017

work page 2017

[26] [26]

Squeeze-and-excitation net- works,

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation net- works,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018

work page 2018

[27] [27]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar,et al., “Attention is all you need,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017

work page 2017

[28] [28]

Global filter networks for image classification,

Y . Rao, W. Zhao, Z. Zhu, J. Lu, and J. Zhou, “Global filter networks for image classification,” 2021

work page 2021

[29] [29]

Kvasir-seg: A segmented polyp dataset,

D. Jha, P. H. Smedsrud, M. A. Riegler, P. Halvorsen, T. de Lange, D. Johansen, and H. D. Johansen, “Kvasir-seg: A segmented polyp dataset,” inInternational Conference on Multimedia Modeling (MMM), pp. 451–462, Springer, 2020

work page 2020

[30] [30]

Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,

J. Bernal, F. J. S ´anchez, G. Fern ´andez-Esparrach, D. Gil, C. Rodr´ıguez, and F. Vilari˜no, “Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,”Computerized Medical Imaging and Graphics, vol. 43, pp. 99–111, 2015

work page 2015

[31] [31]

An efficient polyp segmentation network,

T. Erol and D. Sarikaya, “An efficient polyp segmentation network,” 2022

work page 2022

[32] [32]

Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer,

J. Silva, A. Histace, O. Romain, X. Dray, and B. Granado, “Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer,”International Journal of Computer Assisted Radiology and Surgery, vol. 9, no. 2, pp. 283–293, 2014

work page 2014

[33] [33]

Pranet: Parallel reverse attention network for polyp segmentation,

D.-P. Fan, G.-P. Ji, T. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao, “Pranet: Parallel reverse attention network for polyp segmentation,” inInternational Conference on Medical Im- age Computing and Computer-Assisted Intervention (MIC- CAI), pp. 263–273, Springer, 2020

work page 2020

[34] [34]

Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

N. Codella, V . Rotemberg, P. Tschandl, M. E. Celebi, S. Dusza, D. Gutman, B. Helba, A. Kalloo, K. Liopy- ris, M. Marchetti,et al., “Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the inter- national skin imaging collaboration (isic),” inarXiv preprint arXiv:1902.03368, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2018

[35] [35]

Ph2-a dermoscopic image database for re- search and benchmarking,

T. Mendonc ¸a, P. M. Ferreira, J. S. Marques, A. R. Marcal, and J. Rozeira, “Ph2-a dermoscopic image database for re- search and benchmarking,” inAnnual International Confer- ence of the IEEE Engineering in Medicine and Biology So- ciety (EMBC), pp. 5437–5440, IEEE, 2013

work page 2013

[36] [36]

Dataset of breast ultrasound images,

W. Al-Dhabyani, M. Gomaa, H. Khaled, and A. Fahmy, “Dataset of breast ultrasound images,”Data in Brief, vol. 28, p. 104863, 2020

work page 2020

[37] [37]

Gland segmentation in colon histology images: The glas challenge contest,

K. Sirinukunwattana, J. P. Pluim, H. Chen, X. Qi, P.-A. Heng, Y . B. Guo, L. Y . Wang, B. J. Matuszewski, E. Bruni, U. Sanchez,et al., “Gland segmentation in colon histology images: The glas challenge contest,”Medical Image Analy- sis, vol. 35, pp. 489–502, 2017

work page 2017

[38] [38]

The multimodal brain tumor image seg- mentation benchmark (brats),

B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y . Burren, N. Porz, J. Slotboom, R. Wiest,et al., “The multimodal brain tumor image seg- mentation benchmark (brats),”IEEE Transactions on Medi- cal Imaging, vol. 34, no. 10, pp. 1993–2024, 2014

work page 1993

[39] [39]

The liver tumor segmentation benchmark (lits),

P. Bilic, P. F. Christ, E. V orontsov, G. Chlebus, H. Chen, Q. Dou, C.-W. Fu, X. Han, P.-A. Heng, J. Hesser,et al., “The liver tumor segmentation benchmark (lits),”Medical Image Analysis, vol. 84, p. 102680, 2023

work page 2023

[40] [40]

2017 Robotic Instrument Segmentation Challenge

M. Allan, A. Shvets, T. Kurmann, Z. Zhang, R. Duggal, Y .- H. Su, N. Rieke, I. Laina, N. Kalavakonda, S. Bodenstedt, et al., “2017 robotic instrument segmentation challenge,” arXiv preprint arXiv:1902.06426, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2017

[41] [41]

Ridge-based vessel segmentation in color images of the retina,

J. Staal, M. D. Abr `amoff, M. Niemeijer, M. A. Viergever, and B. Van Ginneken, “Ridge-based vessel segmentation in color images of the retina,”IEEE Transactions on Medical Imaging, vol. 23, no. 4, pp. 501–509, 2004

work page 2004

[42] [42]

An ensemble classification-based approach applied to retinal blood vessel segmentation,

M. M. Fraz, P. Remagnino, A. Hoppe, B. Uyyanonvara, A. R. Rudnicka, C. G. Owen, and S. A. Barman, “An ensemble classification-based approach applied to retinal blood vessel segmentation,”IEEE Transactions on Biomedical Engineer- ing, vol. 59, no. 9, pp. 2538–2548, 2012

work page 2012

[43] [43]

Locating blood vessels in retinal images by piecewise threshold prob- ing of a matched filter response,

A. Hoover, V . Kouznetsova, and M. Goldbaum, “Locating blood vessels in retinal images by piecewise threshold prob- ing of a matched filter response,”IEEE Transactions on Med- ical Imaging, vol. 19, no. 3, pp. 203–210, 2000

work page 2000