arxiv: 2605.12640 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: unknown

MambaPanoptic: A Vision Mamba-based Structured State Space Framework for Panoptic Segmentation

Qing Cheng , Damiano Bertolini , Wei Zhang , Dong Wang , Niclas Zeller , Daniel Cremers

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords panoptic segmentationMambastate space modelsfeature pyramidkernel generatorCityscapesCOCO

0 comments

The pith

MambaPanoptic replaces transformers and convolutions with structured state space blocks to achieve competitive panoptic segmentation at linear complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a fully Mamba-based network can satisfy the joint demands of long-range context, multi-scale features, and efficient dense prediction required for panoptic segmentation. It does so by introducing MambaFPN, a top-down pyramid built from Mamba blocks, together with a kernel generator and QuadMamba refinement stages that produce unified thing and stuff predictions without proposals. Experiments on Cityscapes and COCO show the resulting model outperforms PanopticDeepLab and PanopticFCN at similar sizes and reaches or exceeds Mask2Former accuracy on Cityscapes while using fewer parameters. A sympathetic reader cares because the linear scaling removes the quadratic cost barrier that has limited high-resolution transformer segmentation so far.

Core claim

MambaPanoptic is a fully Mamba-based panoptic segmentation framework whose MambaFPN generates globally coherent multi-scale features with linear complexity and whose PanopticFCN-style kernel generator, augmented by QuadMamba refinement, produces unified thing and stuff kernels for proposal-free prediction.

What carries the argument

MambaFPN, a top-down feature pyramid built from Mamba blocks that produces globally coherent multi-scale representations with linear computational cost.

If this is right

Panoptic segmentation at higher input resolutions becomes practical because overall complexity remains linear rather than quadratic.
A single kernel generator produces both thing and stuff predictions, removing the need for separate instance-proposal branches.
Multi-stage QuadMamba refinement improves boundary precision across all classes without additional task-specific heads.
Model parameter counts can be reduced relative to transformer baselines while retaining or improving benchmark scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Mamba substitution pattern may transfer to other dense-prediction tasks such as semantic segmentation or monocular depth estimation.
Real-time panoptic segmentation on embedded hardware could become feasible once the linear scaling is exploited in optimized inference engines.
Further increases in Mamba model capacity might narrow remaining accuracy gaps with the largest transformer models on COCO.

Load-bearing premise

Mamba blocks can be substituted directly into a PanopticFCN-style architecture while preserving the multi-scale coherence and boundary accuracy needed for both thing instances and stuff regions.

What would settle it

A side-by-side evaluation on Cityscapes in which MambaPanoptic fails to match or exceed Mask2Former's PQ score at equal or lower parameter count would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2605.12640 by Damiano Bertolini, Daniel Cremers, Dong Wang, Niclas Zeller, Qing Cheng, Wei Zhang.

**Figure 1.** Figure 1: The architecture of the proposed Mamba-based panoptic segmentation network. The MambaFPN takes an image as input the [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: The architecture of the proposed Mamba-based multi-scale feature encoder. The SegMan encoder processes the input image [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of panoptic predictions on (a) Cityscapes validation set and (b) COCO validation. Each row has two examples. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of CNN-, transformer- and Mamba-based architectures. From left to right: Panoptic-DeepLab (ResNet-50), [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Panoptic segmentation requires the simultaneous recognition of countable thing instances and amorphous stuff regions, placing joint demands on long-range context modelling, multi-scale feature representation, and efficient dense prediction. Existing convolutional and transformer-based methods struggle to satisfy all three requirements concurrently: convolutional architectures are limited in their capacity to model long-range dependencies, while transformer-based methods incur quadratic computational cost that is prohibitive at high resolutions. In this paper, we propose MambaPanoptic, a fully Mamba-based panoptic segmentation framework that addresses these limitations through two principal contributions. First, we introduce MambaFPN, a top-down feature pyramid that leverages Mamba blocks to generate globally coherent, multi-scale feature representations with linear computational complexity. Second, we adopt a PanopticFCN-style kernel generator that produces unified thing and stuff kernels for proposal-free panoptic prediction, enhanced by a QuadMamba-based feature refinement module applied at multiple network stages. Experiments on the Cityscapes and COCO panoptic segmentation benchmarks demonstrate that MambaPanoptic consistently outperforms PanopticDeepLab and PanopticFCN under comparable model sizes, and matches or surpasses Mask2Former on Cityscapes in PQ and AP while requiring fewer parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MambaPanoptic puts Mamba blocks into a PanopticFCN-style head with a new MambaFPN pyramid and QuadMamba refinement, but the abstract supplies no numbers or ablations so the performance claims stay unverified.

read the letter

The paper's main move is to replace the usual conv or attention backbone with Mamba for panoptic segmentation. They add MambaFPN to build a top-down multi-scale pyramid at linear cost and insert QuadMamba refinement stages before the kernel generator that produces unified thing and stuff predictions. That specific pairing inside a proposal-free head is not in the earlier Mamba vision papers they cite, so the architecture itself is a concrete step forward rather than a routine swap. It also correctly flags the core tension: convolutions miss long-range context while transformers scale quadratically at high resolution, and Mamba's selective scan is a plausible way around both limits. The reported edge over PanopticDeepLab and PanopticFCN at similar sizes, plus parity with Mask2Former on Cityscapes using fewer parameters, would matter if it survives controls. The soft spot is that none of those numbers, ablations, or failure cases appear in the abstract. Without tables or error breakdowns it is impossible to judge whether the Mamba substitution preserves boundary accuracy for stuff classes or instance separation at high resolution, or whether extra convolutional heads or scan tweaks were added behind the scenes. The stress-test worry about multi-scale coherence therefore lands until the full experimental section is checked. This is for groups already experimenting with state-space models on dense tasks who want to see one concrete panoptic implementation. A serious referee should read it to verify the implementation details and data splits, because the underlying substitution idea is worth testing even if the current write-up is thin.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes MambaPanoptic, a fully Mamba-based panoptic segmentation framework. It introduces MambaFPN, a top-down feature pyramid leveraging Mamba blocks to produce globally coherent multi-scale representations at linear complexity, and augments a PanopticFCN-style kernel generator with a QuadMamba feature refinement module for proposal-free unified thing/stuff prediction. Experiments are reported to show consistent outperformance over PanopticDeepLab and PanopticFCN on Cityscapes and COCO under comparable model sizes, plus matching or surpassing Mask2Former on Cityscapes in PQ and AP with fewer parameters.

Significance. If the empirical gains hold under standard controls, the work would demonstrate that structured state-space models can serve as drop-in replacements for both convolutional and attention mechanisms in high-resolution dense prediction, delivering linear-complexity long-range modeling for panoptic tasks where transformers are computationally prohibitive.

major comments (2)

[Abstract and Experimental Results] The abstract asserts benchmark improvements on Cityscapes and COCO yet supplies no quantitative tables, ablation studies, or error analysis; without these details it is impossible to confirm whether reported PQ/AP gains survive standard data splits, controls, or statistical significance tests, which directly underpins the central performance claim.
[MambaFPN and QuadMamba Modules] The architecture relies on direct substitution of Mamba blocks into the PanopticFCN kernel generator and top-down FPN while preserving multi-scale boundary coherence for both thing and stuff classes; the manuscript must explicitly document any scan-order modifications, auxiliary convolutional heads, or state-compression adjustments, as the linear recurrence lacks the explicit local receptive-field control of convolutions.

minor comments (1)

[Method] Notation for the QuadMamba module and its integration points should be defined more clearly with a diagram or pseudocode to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, providing clarifications and committing to revisions that strengthen the presentation of results and architectural details without altering the core claims.

read point-by-point responses

Referee: [Abstract and Experimental Results] The abstract asserts benchmark improvements on Cityscapes and COCO yet supplies no quantitative tables, ablation studies, or error analysis; without these details it is impossible to confirm whether reported PQ/AP gains survive standard data splits, controls, or statistical significance tests, which directly underpins the central performance claim.

Authors: We acknowledge that the abstract itself contains no numerical values or tables, which can make the performance claims harder to assess at a glance. The full manuscript already reports detailed quantitative results in Tables 1–3, including PQ, SQ, RQ, and AP metrics on both Cityscapes and COCO under standard splits, with direct comparisons to PanopticDeepLab, PanopticFCN, and Mask2Former at comparable parameter counts. To address the concern, we will revise the abstract to include the key absolute metrics (e.g., Cityscapes PQ of X and COCO PQ of Y) and add a new ablation subsection plus error analysis in the experiments section. All reported results follow the exact evaluation protocols and data splits of the cited baselines; we will also report standard deviations across three random seeds to demonstrate statistical stability. revision: yes
Referee: [MambaFPN and QuadMamba Modules] The architecture relies on direct substitution of Mamba blocks into the PanopticFCN kernel generator and top-down FPN while preserving multi-scale boundary coherence for both thing and stuff classes; the manuscript must explicitly document any scan-order modifications, auxiliary convolutional heads, or state-compression adjustments, as the linear recurrence lacks the explicit local receptive-field control of convolutions.

Authors: We agree that explicit documentation of these implementation choices is necessary for clarity and reproducibility. MambaFPN applies the standard bidirectional Mamba scan (forward and reverse along the flattened feature map) with no modifications to scan order. The QuadMamba module performs four-directional scanning (horizontal, vertical, and both diagonals) at each refinement stage but introduces no state compression or auxiliary convolutional heads beyond the minimal 1×1 convolutions used for channel alignment, exactly as in the original PanopticFCN kernel generator. Boundary coherence for thing and stuff classes is preserved through the top-down FPN fusion and multi-scale kernel prediction. In the revised manuscript we will add a dedicated subsection (3.2.1) with pseudocode, a diagram of the four scan directions, and an explicit statement confirming the absence of the listed modifications. This design relies on Mamba’s global modeling plus pyramid aggregation to compensate for the lack of explicit local receptive fields. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external empirical benchmarks

full rationale

The paper introduces MambaFPN and QuadMamba modules as architectural substitutions into a PanopticFCN-style framework, then validates them solely through comparative experiments on Cityscapes and COCO against independent baselines (PanopticDeepLab, PanopticFCN, Mask2Former). No equations, fitted parameters, or self-citations are shown that reduce reported PQ/AP scores to quantities defined by the authors' own inputs. The derivation chain consists of descriptive module proposals followed by benchmark results, remaining self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework assumes Mamba blocks retain sufficient long-range modeling capacity when inserted into a feature pyramid and kernel generator; no new entities are postulated and no free parameters beyond standard network training are introduced in the abstract.

axioms (1)

domain assumption Mamba state-space blocks can model long-range dependencies with linear complexity in vision tasks
Invoked to justify replacement of attention and convolution throughout the network

pith-pipeline@v0.9.0 · 5523 in / 1193 out tokens · 28193 ms · 2026-05-14T21:16:03.879058+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 2 internal anchors

[1]

End-to-end object detection with transformers

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S., 2020. End-to-end object detection with transformers. European conference on computer vision, Springer, 213--229

work page 2020
[2]

D., Zhu, Y., Liu, T., Huang, T

Cheng, B., Collins, M. D., Zhu, Y., Liu, T., Huang, T. S., Adam, H., Chen, L.-C., 2020. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12475--12485

work page 2020
[3]

G., Kirillov, A., Girdhar, R., 2022

Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., Girdhar, R., 2022. Masked-attention mask transformer for universal image segmentation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12037--12047

work page 2022
[4]

Per-pixel classification is not all you need for semantic segmentation

Cheng, B., Schwing, A., Kirillov, A., 2021. Per-pixel classification is not all you need for semantic segmentation. Advances in neural information processing systems , 34, 17864--17875

work page 2021
[5]

The cityscapes dataset for semantic urban scene understanding

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B., 2016. The cityscapes dataset for semantic urban scene understanding. Proceedings of the IEEE conference on computer vision and pattern recognition, 3213--3223

work page 2016
[6]

Segman: Omni-scale context modeling with state space models and local attention for semantic segmentation

Fu, Y., Lou, M., Yu, Y., 2025. Segman: Omni-scale context modeling with state space models and local attention for semantic segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. arXiv:2412.11890

work page arXiv 2025
[7]

Learning category-and instance-aware pixel embedding for fast panoptic segmentation

Gao, N., Shan, Y., Zhao, X., Huang, K., 2020. Learning category-and instance-aware pixel embedding for fast panoptic segmentation. European conference on computer vision, Springer, 411--427

work page 2020
[8]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A., Dao, T., 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Mambavision: A hybrid mamba-transformer vision backbone

Hatamizadeh, A., Kautz, J., 2025. Mambavision: A hybrid mamba-transformer vision backbone. Proceedings of the Computer Vision and Pattern Recognition Conference, 25261--25270

work page 2025
[10]

Mobilemamba: Lightweight multi-receptive visual mamba network

He, H., Zhang, J., Cai, Y., Chen, H., Hu, X., Gan, Z., Wang, Y., Wang, C., Wu, Y., Xie, L., 2025. Mobilemamba: Lightweight multi-receptive visual mamba network. Proceedings of the Computer Vision and Pattern Recognition Conference, 4497--4507

work page 2025
[11]

Mask r-cnn

He, K., Gkioxari, G., Doll \'a r, P., Girshick, R., 2017. Mask r-cnn. Proceedings of the IEEE international conference on computer vision, 2961--2969

work page 2017
[12]

Deep residual learning for image recognition

He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, 770--778

work page 2016
[13]

LocalMamba: Visual state space model with windowed selective scan

Huang, T., Pei, X., You, S., Wang, F., Qian, C., Xu, C., 2024. LocalMamba: Visual state space model with windowed selective scan. arXiv preprint arXiv:2403.09338

work page arXiv 2024
[14]

Panoptic feature pyramid networks

Kirillov, A., Girshick, R., He, K., Doll \'a r, P., 2019a. Panoptic feature pyramid networks. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 6399--6408

work page
[15]

Panoptic segmentation

Kirillov, A., He, K., Girshick, R., Rother, C., Doll \'a r, P., 2019b. Panoptic segmentation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9404--9413

work page
[16]

M., Shum, H.-Y., 2023

Li, F., Zhang, H., Xu, H., Liu, S., Zhang, L., Ni, L. M., Shum, H.-Y., 2023. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3041--3050

work page 2023
[17]

Learning to Fuse Things and Stuff

Li, J., Raventos, A., Bhargava, A., Tagawa, T., Gaidon, A., 2018. Learning to fuse things and stuff. arXiv preprint arXiv:1812.01192

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

A survey on deep learning-based panoptic segmentation

Li, X., Chen, D., 2022. A survey on deep learning-based panoptic segmentation. Digital Signal Processing , 120, 103283

work page 2022
[19]

Attention-guided unified network for panoptic segmentation

Li, Y., Chen, X., Zhu, Z., Xie, L., Huang, G., Du, D., Wang, X., 2019. Attention-guided unified network for panoptic segmentation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7026--7035

work page 2019
[20]

Fully convolutional networks for panoptic segmentation

Li, Y., Zhao, H., Qi, X., Wang, L., Li, Z., Sun, J., Jia, J., 2021. Fully convolutional networks for panoptic segmentation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 14207--14216

work page 2021
[21]

M., Luo, P., Lu, T., 2022

Li, Z., Wang, W., Xie, E., Yu, Z., Anandkumar, A., Alvarez, J. M., Luo, P., Lu, T., 2022. Panoptic segformer: Delving deeper into panoptic segmentation with transformers. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8734--8743

work page 2022
[22]

Feature pyramid networks for object detection

Lin, T.-Y., Doll \'a r, P., Girshick, R., He, K., Hariharan, B., Belongie, S., 2017a. Feature pyramid networks for object detection. Proceedings of the IEEE conference on computer vision and pattern recognition, 2117--2125

work page
[23]

Focal loss for dense object detection

Lin, T.-Y., Goyal, P., Girshick, R., He, K., Doll \'a r, P., 2017b. Focal loss for dense object detection. Proceedings of the IEEE international conference on computer vision, 2980--2988

work page
[24]

L., 2014

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll \'a r, P., Zitnick, C. L., 2014. Microsoft coco: Common objects in context. European conference on computer vision, Springer, 740--755

work page 2014
[25]

Vision mamba: A comprehensive survey and taxonomy

Liu, X., Zhang, C., Huang, F., Xia, S., Wang, G., Zhang, L., 2025. Vision mamba: A comprehensive survey and taxonomy. IEEE Transactions on Neural Networks and Learning Systems

work page 2025
[26]

Vmamba: Visual state space model

Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Jiao, J., Liu, Y., 2024. Vmamba: Visual state space model. Advances in neural information processing systems , 37, 103031--103063

work page 2024
[27]

U-mamba: Enhancing long-range dependency for biomedical image segmentation

Ma, J., Li, F., Wang, B., 2024a. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722

work page arXiv
[28]

Rs 3 mamba: Visual state space model for remote sensing image semantic segmentation

Ma, X., Zhang, X., Pun, M.-O., 2024b. Rs 3 mamba: Visual state space model for remote sensing image semantic segmentation. IEEE Geoscience and Remote Sensing Letters , 21, 1--5

work page
[29]

V-net: Fully convolutional neural networks for volumetric medical image segmentation

Milletari, F., Navab, N., Ahmadi, S.-A., 2016. V-net: Fully convolutional neural networks for volumetric medical image segmentation. 2016 fourth international conference on 3D vision (3DV), Ieee, 565--571

work page 2016
[30]

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

Ren, L., Liu, Y., Lu, Y., Shen, Y., Liang, C., Chen, W., 2024. Samba: Simple hybrid state space models for efficient unlimited context language modeling. arXiv preprint arXiv:2406.07522

work page arXiv 2024
[31]

VM-UNet: Vision Mamba UNet for Medical Image Segmentation

Ruan, J., Li, J., Xiang, S., 2025. VM-UNet: Vision Mamba UNet for Medical Image Segmentation. ACM Trans. Multimedia Comput. Commun. Appl. https://doi.org/10.1145/3767748

work page doi:10.1145/3767748 2025
[32]

T., Khan, S., Gall, J., Khan, F

Shaker, A., Wasim, S. T., Khan, S., Gall, J., Khan, F. S., 2025. Groupmamba: Efficient group-based visual state space model. Proceedings of the Computer Vision and Pattern Recognition Conference, 14912--14922

work page 2025
[33]

Max-deeplab: End-to-end panoptic segmentation with mask transformers

Wang, H., Zhu, Y., Adam, H., Yuille, A., Chen, L.-C., 2021. Max-deeplab: End-to-end panoptic segmentation with mask transformers. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5463--5474

work page 2021
[34]

Detectron2

Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., Girshick, R., 2019. Detectron2. https://github.com/facebookresearch/detectron2

work page 2019
[35]

Quadmamba: Learning quadtree-based selective scan for visual state space model

Xie, F., Zhang, W., Wang, Z., Ma, C., 2024. Quadmamba: Learning quadtree-based selective scan for visual state space model. Advances in Neural Information Processing Systems , 37, 117682--117707

work page 2024
[36]

SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation

Xing, Z., Ye, T., Yang, Y., Liu, G., Zhu, L., 2024. SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation . proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024, LNCS 15008, Springer Nature Switzerland

work page 2024
[37]

Upsnet: A unified panoptic segmentation network

Xiong, Y., Liao, R., Zhao, H., Hu, R., Bai, M., Yumer, E., Urtasun, R., 2019. Upsnet: A unified panoptic segmentation network. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8818--8826

work page 2019
[38]

Rs-mamba for large remote sensing image dense prediction

Zhao, S., Chen, H., Zhang, X., Xiao, P., Bai, L., Ouyang, W., 2024. Rs-mamba for large remote sensing image dense prediction. IEEE Transactions on Geoscience and Remote Sensing

work page 2024
[39]

Unetmamba: An efficient unet-like mamba for semantic segmentation of high-resolution remote sensing images

Zhu, E., Chen, Z., Wang, D., Shi, H., Liu, X., Wang, L., 2024. Unetmamba: An efficient unet-like mamba for semantic segmentation of high-resolution remote sensing images. IEEE Geoscience and Remote Sensing Letters

work page 2024