arxiv: 2605.11131 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: no theorem link

USEMA: a Scalable Efficient Mamba Like Attention for Medical Image Segmentation

Elisha Dayag , Nhat Thanh Tran , Jack Xin

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords medical image segmentationefficient attentionMamba-like attentionhybrid UNetlocal window attentionarithmetic averagingvision transformerscomputational efficiency

0 comments

The pith

USEMA integrates local window attention and arithmetic averaging into a UNet to deliver more accurate medical image segmentation at lower computational cost than full self-attention transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces USEMA, a hybrid UNet architecture that pairs convolutional layers for local detail with a new attention module called SEMA. SEMA avoids the quadratic cost and focus dispersion of standard self-attention by restricting most interactions to local windows and using simple arithmetic averaging to pull in global information. Across experiments on multiple imaging modalities and resolutions, USEMA shows higher segmentation accuracy than pure CNNs or Mamba models and runs faster than transformer baselines. A reader should care because medical segmentation routinely needs both fine local boundaries and broad context, yet existing efficient alternatives often sacrifice one for the other.

Core claim

The central claim is that token localization through local window attention combined with theoretically consistent arithmetic averaging produces a scalable form of global attention that, when embedded in a CNN-UNet backbone, yields both higher Dice scores and lower FLOPs than either pure convolutional networks, Mamba-based models, or vision transformers that rely on full self-attention.

What carries the argument

SEMA (Scalable and Efficient Mamba-like Attention), which restricts token interactions to local windows to preserve focus and supplements them with arithmetic averaging to recover global context.

If this is right

USEMA can process larger 2D slices or volumes without the memory explosion typical of full attention.
The same hybrid block can be dropped into other encoder-decoder segmentation networks to trade quadratic cost for linear scaling.
Segmentation accuracy improves on both high-resolution and low-contrast modalities without modality-specific redesign.
Inference speed gains make the model more practical for clinical workflows that require near-real-time output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The arithmetic-averaging step offers a lightweight alternative to more complex state-space or selective-scan mechanisms in other efficient-attention designs.
Because the method keeps most computation local, it may extend naturally to 3D volumetric segmentation where global attention is even more expensive.
The separation of local focus from global averaging could be tested as a plug-in module for non-medical dense-prediction tasks such as semantic segmentation in autonomous driving.
If the averaging proves sufficient, future work could explore replacing it with learned but still linear global pooling to close any remaining gap with full attention.

Load-bearing premise

The assumption that local-window attention plus arithmetic averaging reliably gathers enough long-range context without dispersion or loss of focus, and that the resulting hybrid produces consistent gains over baselines without per-dataset retuning.

What would settle it

On a held-out medical dataset or at higher image resolutions, if USEMA's Dice score falls below a well-tuned transformer baseline or its runtime advantage disappears without extra hyperparameter search, the claim of reliable superiority would be refuted.

Figures

Figures reproduced from arXiv: 2605.11131 by Elisha Dayag, Jack Xin, Nhat Thanh Tran.

**Figure 1.** Figure 1: Left: The attention matrix of UNETR on a sequence of length 5376 Right: The attention matrix that would be obtained via a uniform attention i.e. setting each entry to 1/5376. Note that the values in the left-hand matrix are all within 1e − 4 of the mean and bounded above by 2e − 4 and below by 1e − 4. A way to mitigate the computational cost is to limit the context window of the selection process, that is … view at source ↗

**Figure 2.** Figure 2: USEMA architecture (left) and its SEMA block (right). 2.2 USEMA: Merging SEMA with UNet To utilize SEMA for medical image segmentation, we merge SEMA into the UNet framework to create USEMA, depicted in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Accurate medical image segmentation is an integral part of the medical image analysis pipeline that requires the ability to merge local and global information. While vision transformers are able to capture global interactions using vanilla self-attention, their quadratic computational complexity in the input size remains a struggle for medical image segmentation tasks. Motivated by the dispersion property of vanilla self-attention and recent development of Mamba form of attention, Scalable and Efficient Mamba like Attention (SEMA) utilizes token localization via local window attention to avoid dispersion and maintain focusing, complemented by theoretically consistent arithmetic averaging to capture global aspect of attention. In this work, we present USEMA, a hybrid UNet architecture that merges the local feature extraction ability of convolutional neural networks (CNNs) with SEMA attention. We conduct experiments with USEMA across a variety of modalities and image sizes, demonstrating improved computational efficiency compared to transformer based models using full self-attention, and superior segmentation performance relative to purely convolution and Mamba-based models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces USEMA, a hybrid UNet architecture for medical image segmentation that combines CNN-based local feature extraction with a novel Scalable and Efficient Mamba-like Attention (SEMA) module. SEMA employs local window attention for token localization to prevent dispersion and maintain focus, paired with arithmetic averaging to incorporate global context. The authors claim this yields improved computational efficiency over full self-attention transformers and superior segmentation performance compared to pure CNN and Mamba-based models across multiple modalities and image sizes.

Significance. If the empirical claims and the SEMA mechanism are validated with full ablations and derivations, the work could advance efficient attention alternatives for high-resolution medical imaging, where quadratic transformer costs are prohibitive and pure Mamba or CNN models struggle with global dependencies.

major comments (2)

[§3] §3 (SEMA formulation): The description of arithmetic averaging after local-window attention lacks explicit equations showing how distant token interactions are integrated without dispersion or focus loss. If averaging operates only on per-window outputs, long-range dependencies remain unmodeled, directly challenging the central claim that SEMA reliably captures global context; a derivation or counter-example analysis is required.
[Experiments] Experiments section (performance tables): The abstract asserts consistent gains over CNN and Mamba baselines across modalities and sizes, yet no ablation isolates the arithmetic-averaging component versus local windows alone. Without such controls or statistical significance tests, the superiority cannot be attributed to the hybrid design and may be dataset-specific.

minor comments (2)

[Abstract] Abstract: The phrase 'theoretically consistent arithmetic averaging' is introduced without a one-sentence definition or reference to the supporting derivation; adding this would improve immediate clarity.
[Method] Notation: Local-window size and averaging scope are not defined with symbols in the high-level description; consistent variable names (e.g., W for window, A for averaging operator) would aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Revisions have been made to the manuscript to incorporate additional mathematical detail and empirical controls.

read point-by-point responses

Referee: [§3] §3 (SEMA formulation): The description of arithmetic averaging after local-window attention lacks explicit equations showing how distant token interactions are integrated without dispersion or focus loss. If averaging operates only on per-window outputs, long-range dependencies remain unmodeled, directly challenging the central claim that SEMA reliably captures global context; a derivation or counter-example analysis is required.

Authors: We agree that the original §3 would benefit from greater mathematical precision. In the revised manuscript we have expanded the SEMA formulation with explicit equations: local-window attention is first applied independently within each window to localize tokens and avoid dispersion; the resulting per-window outputs are then aggregated via arithmetic averaging across all windows. We derive that this averaging step computes a global mean that propagates information from distant tokens into each local representation while preserving the focusing property of the windowed attention. A short proof sketch and a counter-example (showing that local windows alone fail to link distant regions) have been added to demonstrate that long-range dependencies are modeled without quadratic cost. revision: yes
Referee: [Experiments] Experiments section (performance tables): The abstract asserts consistent gains over CNN and Mamba baselines across modalities and sizes, yet no ablation isolates the arithmetic-averaging component versus local windows alone. Without such controls or statistical significance tests, the superiority cannot be attributed to the hybrid design and may be dataset-specific.

Authors: We acknowledge the value of isolating the averaging component. The revised Experiments section now includes a dedicated ablation table comparing the full SEMA (local windows + arithmetic averaging) against a local-window-only variant across all modalities and image sizes. The results show consistent additional gains from the averaging step. We have also added paired t-tests on Dice scores, confirming statistical significance (p < 0.05) of the observed improvements. These controls indicate that the reported superiority is attributable to the complete hybrid design rather than dataset idiosyncrasies. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical validation of proposed architecture

full rationale

The paper presents USEMA as a hybrid UNet merging CNN local extraction with SEMA (local-window attention plus arithmetic averaging for global context). The abstract and described method contain no derivation chain, equations, or fitted parameters that reduce a 'prediction' to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatz is smuggled, and no renaming of known results occurs. Performance claims (efficiency vs. transformers, accuracy vs. CNN/Mamba baselines) are externally verifiable via experiments across modalities and sizes, making the work self-contained against benchmarks rather than tautological. This matches the default expectation that most papers are non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that local-window-plus-averaging attention preserves global information equivalently to full self-attention.

pith-pipeline@v0.9.0 · 5470 in / 1137 out tokens · 33443 ms · 2026-05-13T07:08:55.752706+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 3 internal anchors

[1]

2017 robotic instrument segmentation challenge.arXiv preprint arXiv:1902.06426, 2019

Allan, M., Shvets, A., Kurmann, T., Zhang, Z., Duggal, R., Su, Y.H., Rieke, N., Laina, I., Kalavakonda, N., Bodenstedt, S., et al.: 2017 robotic instrument segmen- tation challenge. arXiv preprint arXiv:1902.06426 (2019)

work page arXiv 2017
[2]

Layer Normalization

Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

arXiv preprint arXiv:2102.10882 , year=

Chu, X., Tian, Z., Zhang, B., Wang, X., Shen, C.: Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882 (2021)

work page arXiv 2021
[4]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., Guo, B.: Cswin transformer: A general vision transformer backbone with cross-shaped win- dows. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12124–12134 (2022)

work page 2022
[5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A.: An image is worth 16x16 words: Transformers for image recogni- tion at scale. arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[6]

In: First Conference on Language Modeling (2024), https://openreview.net/forum?id=tEYskw1VY2

Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selec- tive state spaces. In: First Conference on Language Modeling (2024), https://openreview.net/forum?id=tEYskw1VY2

work page 2024
[7]

In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H

Gu, A., Dao, T., Ermon, S., Rudra, A., Ré, C.: Hippo: Recurrent memory with op- timal polynomial projections. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 1474–1487. Curran Associates, Inc. (2020)

work page 2020
[8]

Advances in neural information processing systems 37, 127181–127203 (2024)

Han, D., Wang, Z., Xia, Z., Han, Y., Pu, Y., Ge, C., Song, J., Song, S., Zheng, B., Huang, G.: Demystify mamba in vision: A linear attention perspective. Advances in neural information processing systems 37, 127181–127203 (2024)

work page 2024
[9]

In: International MICCAI brainlesion workshop

Hatamizadeh, A., Nath, V., Tang, Y., Yang, D., Roth, H.R., Xu, D.: Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In: International MICCAI brainlesion workshop. pp. 272–284. Springer (2021)

work page 2021
[10]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vi- sion

Hatamizadeh, A., Tang, Y., Nath, V., Yang, D., Myronenko, A., Landman, B., Roth, H.R., Xu, D.: Unetr: Transformers for 3d medical image segmentation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vi- sion. pp. 574–584 (2022)

work page 2022
[11]

Nature methods 18(2), 203–211 (2021)

Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnu-net: a self-conﬁguring method for deep learning-based biomedical image segmentation. Nature methods 18(2), 203–211 (2021)

work page 2021
[12]

arXiv preprint arXiv:2410.23738 (2024)

Jiang, Y., Li, Z., Chen, X., Xie, H., Cai, J.: Mlla-unet: Mamba-like linear atten- tion in an eﬃcient u-shape model for medical image segmentation. arXiv preprint arXiv:2410.23738 (2024)

work page arXiv 2024
[13]

In: International conference on machine learning

Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are rnns: Fast autoregressive transformers with linear attention. In: International conference on machine learning. pp. 5156–5165. PMLR (2020) 10 E. Dayag et al

work page 2020
[14]

In: Lebanon, G., Vishwanathan, S.V.N

Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z.: Deeply-Supervised Nets. In: Lebanon, G., Vishwanathan, S.V.N. (eds.) Proceedings of the Eighteenth Interna- tional Conference on Artiﬁcial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 38, pp. 562–570. PMLR, San Diego, California, USA (09–12 May 2015), https://proceedings....

work page 2015
[15]

Transmamba: Flexibly switching between transformer and mamba

Li, Y., Xie, R., Yang, Z., Sun, X., Li, S., Han, W., Kang, Z., Cheng, Y., Xu, C., Wang, D., et al.: Transmamba: Flexibly switching between transformer and mamba. arXiv preprint arXiv:2503.24067 (2025)

work page arXiv 2025
[16]

In: International conference on medical image computing and computer- assisted intervention

Liu, J., Yang, H., Zhou, H.Y., Xi, Y., Yu, L., Li, C., Liang, Y., Shi, G., Yu, Y., Zhang, S., et al.: Swin-umamba: Mamba-based unet with imagenet-based pre- training. In: International conference on medical image computing and computer- assisted intervention. pp. 615–625. Springer (2024)

work page 2024
[17]

U-mamba: Enhancing long-range dependency for biomedical image segmentation

Ma, J., Li, F., Wang, B.: U-mamba: Enhancing long-range dependency for biomed- ical image segmentation. arXiv preprint arXiv:2401.04722 (2024)

work page arXiv 2024
[18]

Nature methods 21(6), 1103–1113 (2024)

Ma, J., Xie, R., Ayyadhury, S., Ge, C., Gupta, A., Gupta, R., Gu, S., Zhang, Y., Lee, G., Kim, J., et al.: The multimodality cell segmentation challenge: toward universal solutions. Nature methods 21(6), 1103–1113 (2024)

work page 2024
[19]

The Lancet Digital Health 6(11), e815–e826 (2024)

Ma, J., Zhang, Y., Gu, S., Ge, C., Mae, S., Young, A., Zhu, C., Yang, X., Meng, K., Huang, Z., et al.: Unleashing the strengths of unlabelled data in deep learning- assisted pan-cancer abdominal organ quantiﬁcation: the ﬂare22 challenge. The Lancet Digital Health 6(11), e815–e826 (2024)

work page 2024
[20]

In: Proc

Maas, A.L., Hannun, A.Y., Ng, A.Y., et al.: Rectiﬁer nonlinearities improve neural network acoustic models. In: Proc. icml. vol. 30, p. 3. Atlanta, GA (2013)

work page 2013
[21]

Informatics in medicine unlocked 47, 101504 (2024)

Rayed, M.E., Islam, S.S., Niha, S.I., Jim, J.R., Kabir, M.M., Mridha, M.F.: Deep learning for medical image segmentation: State-of-the-art advancements and chal- lenges. Informatics in medicine unlocked 47, 101504 (2024)

work page 2024
[22]

In: International Conference on Medical image computing and computer-assisted intervention

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi- cal image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)

work page 2015
[23]

Neurocomputing 568, 127063 (2024)

Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced trans- former with rotary position embedding. Neurocomputing 568, 127063 (2024)

work page 2024
[24]

arXiv:2506.08297, to appear in Intern

Tran, N.T., Xue, F., Zhang, S., Lyu, J., Zheng, Y., Qi, Y., Xin, J.: SEMA: a scalable and eﬃcient mamba like attention via token localization and averaging. arXiv:2506.08297, to appear in Intern. Conf. Machine Learning (ICML) 2026

work page arXiv 2026
[25]

Instance Normalization: The Missing Ingredient for Fast Stylization

Ulyanov, D., Vedaldi, A., Lempitsky, V.: Instance normalization: The missing in- gredient for fast stylization. arXiv preprint arXiv:1607.08022 (2016)

work page Pith review arXiv 2016
[26]

Advances in neural information pro- cessing systems 30 (2017)

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems 30 (2017)

work page 2017
[27]

Linformer: Self-Attention with Linear Complexity

Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H.: Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2006
[28]

arXiv preprint arXiv:2402.05079 (2024)

Wang, Z., Zheng, J.Q., Zhang, Y., Cui, G., Li, L.: Mamba-unet: Unet-like pure visual mamba for medical image segmentation. arXiv preprint arXiv:2402.05079 (2024)

work page arXiv 2024
[29]

Nature 630, 181–188 (2024)

Wu, H., et al: A whole-slide foundational model for digital pathology from real- world data. Nature 630, 181–188 (2024)

work page 2024
[30]

IEEE Transactions on Medical Imaging (2025) Title Suppressed Due to Excessive Length 11

Xing, Z., Ye, T., Yang, Y., Cai, D., Gai, B., Wu, X.J., Gao, F., Zhu, L.: Segmamba- v2: Long-range sequential modeling mamba for general 3d medical image segmen- tation. IEEE Transactions on Medical Imaging (2025) Title Suppressed Due to Excessive Length 11

work page 2025
[31]

Medical Image Analysis p

Zhang, Z., Ma, Q., Zhang, T., Chen, J., Zheng, H., Gao, W.: Switch-umamba: Dynamic scanning vision mamba unet for medical image segmentation. Medical Image Analysis p. 103792 (2025)

work page 2025
[32]

In: Proceed- ings of the AAAI conference on artiﬁcial intelligence

Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., Zhang, W.: Informer: Beyond eﬃcient transformer for long sequence time-series forecasting. In: Proceed- ings of the AAAI conference on artiﬁcial intelligence. vol. 35, pp. 11106–11115 (2021)

work page 2021
[33]

IEEE transactions on image processing 32, 4036–4045 (2023)

Zhou, H.Y., Guo, J., Zhang, Y., Han, X., Yu, L., Wang, L., Yu, Y.: nnformer: Volumetric medical image segmentation via a 3d transformer. IEEE transactions on image processing 32, 4036–4045 (2023)

work page 2023