pith. sign in

arxiv: 2505.14062 · v4 · submitted 2025-05-20 · 💻 cs.CV

FractalMamba++: Scaling Vision Mamba Across Resolutions via Hilbert Fractal Geometry

Pith reviewed 2026-05-22 13:54 UTC · model grok-4.3

classification 💻 cs.CV
keywords hilbert curvevision mambafractal serializationresolution scalingstate space modelposition encodinghigh-resolution visionimage segmentation
0
0 comments X

The pith

Hilbert fractal curves let Vision Mamba models keep spatial continuity when input resolution changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that turning a 2D image grid into a 1D sequence for state-space models loses neighborhood information, and this loss grows worse at resolutions different from training. It proposes that the Hilbert curve's recursive self-similar path can dictate how patches are ordered, where shortcuts are added to the state, and how positions are encoded so that local 2D relations stay consistent across scales. If this holds, Mamba-based vision models could process high-resolution inputs for classification, detection, and segmentation without the usual drop in accuracy that comes from mismatched sequence statistics.

Core claim

The central claim is that a single geometric principle—the recursive structure of the Hilbert curve—determines patch serialization, derives deterministic state-injection routes, and augments position encoding so that feature interactions reflect actual spatial proximity rather than 1D order, enabling Vision Mamba to scale across resolutions while preserving local neighborhoods.

What carries the argument

The Hilbert curve, a space-filling path whose recursive subdivisions keep nearby 2D patches close in the 1D sequence, applied here to create fractal serialization, hierarchy skip connections, and fractal-aware rotary position encoding.

If this is right

  • Performance improves over prior Mamba vision models on ImageNet-1K classification, with larger gains at high resolutions.
  • Detection and instance segmentation accuracy rises on COCO when inputs exceed training resolution.
  • Semantic segmentation on ADE20K and change detection on LEVIR-CD+ benefit similarly from the resolution-consistent ordering.
  • The skip connections and position encoding require no learned search or specialized kernels because they follow directly from the curve's recursion levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same curve-based ordering could be tested in other linear-time sequence models to see whether locality preservation helps beyond Mamba.
  • Extending the recursion to three dimensions might allow similar scaling for video or volumetric data without retraining per resolution.
  • If neighborhood consistency is the key mechanism, replacing the Hilbert curve with other locality-preserving space-filling curves could be compared on the same tasks.

Load-bearing premise

Hilbert-curve serialization maintains consistent neighborhood statistics when the image grid size changes.

What would settle it

A direct measurement of average distance between originally adjacent patches after serialization at several resolutions, followed by an ablation showing that the claimed performance gains disappear when those distances vary sharply.

Figures

Figures reproduced from arXiv: 2505.14062 by Bo Li, Haoke Xiao, Lv Tang.

Figure 1
Figure 1. Figure 1: Visualization of correlation between the final state and state tokens [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Top-1 classification accuracy of Mamba-based models across different input resolutions on ImageNet-1K. Results are grouped by parameter scale: [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of the FractalMamba++ backbone. The design contains three Hilbert-geometry-driven components: Fractal-Aware 2D Rotary Position [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
read the original abstract

Vision Mamba offers linear complexity for long visual sequences, yet its performance depends critically on how a two-dimensional patch grid is serialized into a one-dimensional state-space recurrence. Raster-style scans disrupt spatial continuity, and the mismatch between 2D locality and 1D state propagation becomes increasingly severe when the inference resolution grows beyond the training grid. This paper presents FractalMamba++, a resolution-scalable vision backbone organized around a single geometric principle: the recursive self-similar structure of the Hilbert curve determines how patches are serialized, where long-range state shortcuts are inserted, and how positional relations are encoded. First, Hilbert-curve-based Fractal Serialization preserves local 2D neighborhoods more faithfully than linear scans and provides consistent neighborhood statistics across resolutions. Second, the Fractal Hierarchy Skip Connection (FHSC) derives a compact set of deterministic state-injection routes from Hilbert recursion levels, mitigating long-sequence information fading without runtime search, hand-derived gradients, or dedicated CUDA kernels. Third, Fractal-Aware 2D Rotary Position Encoding (FA-RoPE) combines normalized 2D coordinates with a fractal hierarchy level so that feature interactions depend on actual spatial proximity and recursive structural role rather than serialized 1D distance. Extensive experiments on ImageNet-1K classification, COCO detection and instance segmentation, ADE20K semantic segmentation, and LEVIR-CD+ remote sensing change detection show that FractalMamba++ improves over existing Mamba-based vision backbones, especially under high-resolution inputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces FractalMamba++, a resolution-scalable vision backbone for Mamba-based models that organizes serialization, state shortcuts, and positional encoding around the recursive self-similar structure of the Hilbert curve. It proposes three components—Hilbert-curve-based Fractal Serialization to preserve 2D neighborhoods with consistent statistics across resolutions, Fractal Hierarchy Skip Connection (FHSC) for deterministic state-injection routes derived from recursion levels, and Fractal-Aware 2D Rotary Position Encoding (FA-RoPE) that incorporates normalized 2D coordinates and fractal hierarchy levels—and reports empirical gains over prior Mamba vision backbones on ImageNet-1K classification, COCO detection/instance segmentation, ADE20K semantic segmentation, and LEVIR-CD+ change detection, with particular emphasis on high-resolution inputs.

Significance. If the neighborhood-consistency property of Hilbert serialization is quantitatively validated and the reported gains prove robust, the work would supply a deterministic, parameter-light geometric mechanism for scaling state-space vision models to arbitrary resolutions without retraining or custom kernels, addressing a recognized limitation in current Vision Mamba designs and offering a reproducible template for other long-sequence visual tasks.

major comments (2)
  1. [Abstract] Abstract: The central claim that Hilbert-curve serialization 'preserves local 2D neighborhoods more faithfully than linear scans and provides consistent neighborhood statistics across resolutions' is asserted as the geometric foundation for both serialization and position encoding, yet no locality metric (e.g., average 2D Euclidean distance of k-nearest serialized neighbors), ablation, or comparison across grid sizes is supplied; this assumption is load-bearing for the high-resolution gains claimed on COCO, ADE20K, and LEVIR-CD+.
  2. [Experiments] Experiments section: The manuscript reports improvements across four benchmarks but supplies no quantitative details on ablation controls, error bars, or the precise protocol used for training-to-inference resolution scaling; without these, the attribution of gains specifically to the three new components cannot be rigorously assessed.
minor comments (1)
  1. [Abstract] Abstract: The acronyms FHSC and FA-RoPE are introduced without a one-sentence parenthetical gloss, which would aid readers unfamiliar with the method.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects that can strengthen the presentation of our geometric approach and experimental validation. We address each major comment point by point below and indicate the revisions planned for the next version.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that Hilbert-curve serialization 'preserves local 2D neighborhoods more faithfully than linear scans and provides consistent neighborhood statistics across resolutions' is asserted as the geometric foundation for both serialization and position encoding, yet no locality metric (e.g., average 2D Euclidean distance of k-nearest serialized neighbors), ablation, or comparison across grid sizes is supplied; this assumption is load-bearing for the high-resolution gains claimed on COCO, ADE20K, and LEVIR-CD+.

    Authors: We agree that a direct quantitative locality metric would provide stronger, more explicit support for the geometric foundation. In the revised manuscript we will add a dedicated analysis (new subsection or appendix figure) that reports the average 2D Euclidean distance of the k-nearest serialized neighbors (for k=4,8) under Hilbert versus raster serialization, computed on grids of varying sizes (14×14, 28×28, 56×56). We will also include a short ablation that isolates the contribution of this locality property to high-resolution downstream performance. These additions will make the load-bearing assumption directly verifiable. revision: yes

  2. Referee: [Experiments] Experiments section: The manuscript reports improvements across four benchmarks but supplies no quantitative details on ablation controls, error bars, or the precise protocol used for training-to-inference resolution scaling; without these, the attribution of gains specifically to the three new components cannot be rigorously assessed.

    Authors: We concur that additional experimental rigor is required. In the revision we will expand the Experiments section to include: (i) full ablation tables quantifying the incremental contribution of each component (Fractal Serialization, FHSC, FA-RoPE) on all four benchmarks; (ii) mean and standard deviation over at least three independent runs with different random seeds; and (iii) an explicit protocol subsection describing the training resolution (224²), the exact higher inference resolutions tested, and the deterministic interpolation/padding procedure used for resolution scaling without retraining. These details will allow readers to assess attribution of the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity: design choices are independent geometric constructions with external experimental validation

full rationale

The paper's core claims rest on three explicitly introduced components (Fractal Serialization, FHSC, FA-RoPE) whose definitions are derived directly from the known recursive properties of the Hilbert curve rather than from any fitted parameter or self-referential equation. The abstract states the neighborhood-consistency property as a geometric fact about Hilbert curves and then reports downstream empirical gains on ImageNet, COCO, ADE20K and LEVIR-CD+; none of these gains are shown to be algebraically forced by the same quantities used to define the components. No self-citations, uniqueness theorems, or ansatzes from prior author work appear in the provided text, and no prediction is obtained by fitting a subset of the target metric. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the untested premise that Hilbert recursion supplies consistent 2D neighborhood statistics at every scale and that the derived skip routes and position encoding inherit that property without additional fitting.

axioms (1)
  • domain assumption Hilbert-curve-based serialization preserves local 2D neighborhoods more faithfully than linear scans and provides consistent neighborhood statistics across resolutions
    Invoked in the abstract as the single geometric principle organizing serialization, skip connections, and position encoding.
invented entities (2)
  • Fractal Hierarchy Skip Connection (FHSC) no independent evidence
    purpose: Derives deterministic state-injection routes from Hilbert recursion levels to mitigate long-sequence information fading
    Newly introduced component whose routes are claimed to be free of runtime search or hand-derived gradients.
  • Fractal-Aware 2D Rotary Position Encoding (FA-RoPE) no independent evidence
    purpose: Combines normalized 2D coordinates with fractal hierarchy level so feature interactions depend on spatial proximity and recursive structural role
    Newly introduced encoding whose dependence on fractal level is presented as the key to resolution consistency.

pith-pipeline@v0.9.0 · 5802 in / 1490 out tokens · 35563 ms · 2026-05-22T13:54:44.433694+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    the recursive self-similar structure of the Hilbert curve determines how patches are serialized, where long-range state shortcuts are inserted, and how positional relations are encoded... provides consistent neighborhood statistics across resolutions

  • IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    FHSC selects one representative pair of spatially adjacent but sequentially distant sibling segments at each recursion level... E = union over l=1 to L of {(mid(S(1)_l), mid(S(4)_l))}

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 9 internal anchors

  1. [1]

    BERT: pre-training of deep bidirectional transformers for language understanding,

    J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” inNAACL, 2019, pp. 4171–4186

  2. [2]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inICML, vol. 139. PMLR, 2021, pp. 8748–8763

  3. [3]

    BLIP: bootstrapping language- image pre-training for unified vision-language understanding and gen- eration,

    J. Li, D. Li, C. Xiong, and S. C. H. Hoi, “BLIP: bootstrapping language- image pre-training for unified vision-language understanding and gen- eration,” inICML, ser. Proceedings of Machine Learning Research, vol

  4. [4]

    12 888–12 900

    PMLR, 2022, pp. 12 888–12 900

  5. [5]

    Palm: Scaling language modeling with pathways,

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y . Tay, N. Shazeer, V . Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghe- mawat, S. De...

  6. [6]

    GPT-4 Technical Report

    OpenAI, “GPT-4 technical report,”CoRR, vol. abs/2303.08774, 2023

  7. [7]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,”CoRR, vol. abs/2302.13971, 2023

  8. [8]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P. Huang, S. Li, I. Misra, M. G. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. J ´egou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features without superv...

  9. [9]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Dollar, and R. Girshick, “Segment anything,” inICCV, October 2023, pp. 4015–4026. SUBMIT TO IEEE TRANSACTIONS ON MULTIMEDIA 10

  10. [10]

    BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,” inICML, vol. 202, 2023, pp. 19 730–19 742

  11. [11]

    Beta-tuned timestep diffusion model,

    T. Zheng, P. Jiang, B. Wan, H. Zhang, J. Chen, J. Wang, and B. Li, “Beta-tuned timestep diffusion model,” inECCV (3), ser. Lecture Notes in Computer Science, vol. 15061. Springer, 2024, pp. 114–130

  12. [12]

    SAM 2: Segment Anything in Images and Videos

    N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Doll ´ar, and C. Feichtenhofer, “Sam 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024. [Online]. Available: https://arxiv.org/abs/2408. 00714

  13. [13]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Z. Chen, W. Wang, H. Tian, S. Ye, Z. Gao, E. Cui, W. Tong, K. Hu, J. Luo, Z. Maet al., “How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites,”arXiv preprint arXiv:2404.16821, 2024

  14. [14]

    Foundation models defining a new era in vision: a survey and outlook,

    M. Awais, M. Naseer, S. Khan, R. M. Anwer, H. Cholakkal, M. Shah, M.-H. Yang, and F. S. Khan, “Foundation models defining a new era in vision: a survey and outlook,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  15. [15]

    Graph foundation models: Concepts, opportunities and challenges,

    J. Liu, C. Yang, Z. Lu, J. Chen, Y . Li, M. Zhang, T. Bai, Y . Fang, L. Sun, P. S. Yuet al., “Graph foundation models: Concepts, opportunities and challenges,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  16. [16]

    Uni-moe: Scaling unified multimodal llms with mixture of experts,

    Y . Li, S. Jiang, B. Hu, L. Wang, W. Zhong, W. Luo, L. Ma, and M. Zhang, “Uni-moe: Scaling unified multimodal llms with mixture of experts,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 5, pp. 3424–3439, 2025

  17. [17]

    LISA: reasoning segmentation via large language model,

    X. Lai, Z. Tian, Y . Chen, Y . Li, Y . Yuan, S. Liu, and J. Jia, “LISA: reasoning segmentation via large language model,” inCVPR. IEEE, 2024, pp. 9579–9589

  18. [18]

    Towards training-free open-world segmentation via image prompt foundation models,

    L. Tang, P. Jiang, H. Xiao, and B. Li, “Towards training-free open-world segmentation via image prompt foundation models,”Int. J. Comput. Vis., vol. 133, no. 1, pp. 1–15, 2025

  19. [19]

    Vargpt-v1.1: Improve visual autoregressive large unified model via iterative instruction tuning and reinforcement learning,

    X. Zhuang, Y . Xie, Y . Deng, D. Yang, L. Liang, J. Ru, Y . Yin, and Y . Zou, “Vargpt-v1.1: Improve visual autoregressive large unified model via iterative instruction tuning and reinforcement learning,”arXiv preprint arXiv:2504.02949, 2025

  20. [20]

    Advances in neural in- formation processing systems, 35:27730–27744

    J. Pan, C. Liu, J. Wu, F. Liu, J. Zhu, H. B. Li, C. Chen, C. Ouyang, and D. Rueckert, “Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning,”CoRR, vol. abs/2502.19634, 2025

  21. [21]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inNeurIPS, 2017, pp. 5998–6008

  22. [22]

    Transformers are ssms: Generalized models and efficient algorithms through structured state space duality,

    T. Dao and A. Gu, “Transformers are ssms: Generalized models and efficient algorithms through structured state space duality,” inICML. OpenReview.net, 2024

  23. [23]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”CoRR, vol. abs/2312.00752, 2023

  24. [24]

    Localmamba: Visual state space model with windowed selective scan

    T. Huang, X. Pei, S. You, F. Wang, C. Qian, and C. Xu, “Localmamba: Visual state space model with windowed selective scan,”CoRR, vol. abs/2403.09338, 2024

  25. [25]

    Plainmamba: Improving non- hierarchical mamba in visual recognition

    C. Yang, Z. Chen, M. Espinosa, L. Ericsson, Z. Wang, J. Liu, and E. J. Crowley, “Plainmamba: Improving non-hierarchical mamba in visual recognition,”CoRR, vol. abs/2403.17695, 2024

  26. [26]

    Vision mamba: Efficient visual representation learning with bidirectional state space model,

    L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,” inICML. OpenReview.net, 2024

  27. [27]

    Vmamba: Visual state space model,

    Y . Liu, Y . Tian, Y . Zhao, H. Yu, L. Xie, Y . Wang, Q. Ye, J. Jiao, and Y . Liu, “Vmamba: Visual state space model,” inNeurIPS, 2024

  28. [28]

    Grootvl: Tree topology is all you need in state space model,

    Y . Xiao, L. Song, S. Huang, J. Wang, S. Song, Y . Ge, X. Li, and Y . Shan, “Grootvl: Tree topology is all you need in state space model,”CoRR, vol. abs/2406.02395, 2024

  29. [29]

    Resformer: Scaling vits with multi-resolution training,

    R. Tian, Z. Wu, Q. Dai, H. Hu, Y . Qiao, and Y . Jiang, “Resformer: Scaling vits with multi-resolution training,” inCVPR. IEEE, 2023, pp. 22 721–22 731

  30. [30]

    Demystify mamba in vision: A linear attention perspective,

    D. Han, Z. Wang, Z. Xia, Y . Han, Y . Pu, C. Ge, J. Song, S. Song, B. Zheng, and G. Huang, “Demystify mamba in vision: A linear attention perspective,” inNeurIPS, 2024

  31. [31]

    Multi-scale vmamba: Hierarchy in hierarchy visual state space model,

    Y . Shi, M. Dong, and C. Xu, “Multi-scale vmamba: Hierarchy in hierarchy visual state space model,” inNeurIPS, 2024

  32. [32]

    Spatial-mamba: Effective visual state space models via structure-aware state fusion,

    C. Xiao, M. Li, Z. Zhang, D. Meng, and L. Zhang, “Spatial-mamba: Effective visual state space models via structure-aware state fusion,” CoRR, vol. abs/2410.15091, 2024

  33. [33]

    MambaVision: A hybrid mamba- transformer vision backbone,

    A. Hatamizadeh and J. Kautz, “Mambavision: A hybrid mamba- transformer vision backbone,”CoRR, vol. abs/2407.08083, 2024

  34. [34]

    Mamba-r: Vision mamba ALSO needs registers,

    F. Wang, J. Wang, S. Ren, G. Wei, J. Mei, W. Shao, Y . Zhou, A. L. Yuille, and C. Xie, “Mamba-r: Vision mamba ALSO needs registers,” CoRR, vol. abs/2405.14858, 2024

  35. [35]

    Efficientvmamba: Atrous selective scan for light weight visual mamba,

    X. Pei, T. Huang, and C. Xu, “Efficientvmamba: Atrous selective scan for light weight visual mamba,” inAAAI. AAAI Press, 2025, pp. 6443– 6451

  36. [36]

    Boosting vision state space model with fractal scanning,

    H. Xiao, L. Tang, P.-t. Jiang, H. Zhang, J. Chen, and B. Li, “Boosting vision state space model with fractal scanning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 8, 2025, pp. 8646–8654

  37. [37]

    Imagenet classification with deep convolutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inNIPS, 2012, pp. 1106– 1114

  38. [38]

    Very deep convolutional networks for large-scale image recognition,

    K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” inICLR, 2015

  39. [39]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inCVPR. IEEE, 2016, pp. 770–778

  40. [40]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,”CoRR, vol. abs/1704.04861, 2017

  41. [41]

    Designing network design spaces,

    I. Radosavovic, R. P. Kosaraju, R. B. Girshick, K. He, and P. Doll ´ar, “Designing network design spaces,” inCVPR. IEEE, 2020, pp. 10 425– 10 433

  42. [42]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inICLR. OpenReview.net, 2021

  43. [43]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inICCV. IEEE, 2021, pp. 9992–10 002

  44. [44]

    Deit III: revenge of the vit,

    H. Touvron, M. Cord, and H. J ´egou, “Deit III: revenge of the vit,” in ECCV, vol. 13684. Springer, 2022, pp. 516–533

  45. [45]

    Efficiently modeling long sequences with structured state spaces,

    A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,” inICLR. OpenReview.net, 2022

  46. [46]

    Scalable autoregressive image generation with mamba,

    H. Li, J. Yang, K. Wang, X. Qiu, Y . Chou, X. Li, and G. Li, “Scalable autoregressive image generation with mamba,”CoRR, vol. abs/2408.12245, 2024

  47. [47]

    Cobra: Extending mamba to multi-modal large language model for efficient inference,

    H. Zhao, M. Zhang, W. Zhao, P. Ding, S. Huang, and D. Wang, “Cobra: Extending mamba to multi-modal large language model for efficient inference,” inAAAI. AAAI Press, 2025, pp. 10 421–10 429

  48. [48]

    Roformer: Enhanced transformer with rotary position embedding,

    J. Su, M. H. M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,”Neurocomput- ing, vol. 568, p. 127063, 2024

  49. [49]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” inCVPR. IEEE, 2009, pp. 248–255

  50. [50]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

  51. [51]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inECCV. Springer, 2014, pp. 740–755

  52. [52]

    Mask R-CNN,

    K. He, G. Gkioxari, P. Doll ´ar, and R. B. Girshick, “Mask R-CNN,” in ICCV. IEEE Computer Society, 2017, pp. 2980–2988

  53. [53]

    MMDetection: Open MMLab Detection Toolbox and Benchmark

    K. Chen, J. Wang, J. Pang, Y . Cao, Y . Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xuet al., “Mmdetection: Open mmlab detection toolbox and benchmark,”arXiv preprint arXiv:1906.07155, 2019

  54. [54]

    Defmamba: Deformable visual state space model,

    L. Liu, M. Zhang, J. Yin, T. Liu, W. Ji, Y . Piao, and H. Lu, “Defmamba: Deformable visual state space model,”arXiv preprint arXiv:2504.05794, 2025

  55. [55]

    Semantic understanding of scenes through the ade20k dataset,

    B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,”International Journal of Computer Vision, vol. 127, no. 3, pp. 302–321, 2019

  56. [56]

    Unified perceptual parsing for scene understanding,

    T. Xiao, Y . Liu, B. Zhou, Y . Jiang, and J. Sun, “Unified perceptual parsing for scene understanding,” inECCV, 2018, pp. 418–434

  57. [57]

    Changemamba: Re- mote sensing change detection with spatio-temporal state space model. arxiv 2024,

    H. Chen, J. Song, C. Han, J. Xia, and N. Yokoya, “Changemamba: Re- mote sensing change detection with spatio-temporal state space model. arxiv 2024,”arXiv preprint arXiv:2404.03425, 2024

  58. [58]

    Pyramid grafting network for one-stage high resolution saliency detection,

    C. Xie, C. Xia, M. Ma, Z. Zhao, X. Chen, and J. Li, “Pyramid grafting network for one-stage high resolution saliency detection,” inCVPR. IEEE, 2022, pp. 11 707–11 716