FractalMamba++: Scaling Vision Mamba Across Resolutions via Hilbert Fractal Geometry

Bo Li; Haoke Xiao; Lv Tang

arxiv: 2505.14062 · v4 · submitted 2025-05-20 · 💻 cs.CV

FractalMamba++: Scaling Vision Mamba Across Resolutions via Hilbert Fractal Geometry

Bo Li , Haoke Xiao , Lv Tang This is my paper

Pith reviewed 2026-05-22 13:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords hilbert curvevision mambafractal serializationresolution scalingstate space modelposition encodinghigh-resolution visionimage segmentation

0 comments

The pith

Hilbert fractal curves let Vision Mamba models keep spatial continuity when input resolution changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that turning a 2D image grid into a 1D sequence for state-space models loses neighborhood information, and this loss grows worse at resolutions different from training. It proposes that the Hilbert curve's recursive self-similar path can dictate how patches are ordered, where shortcuts are added to the state, and how positions are encoded so that local 2D relations stay consistent across scales. If this holds, Mamba-based vision models could process high-resolution inputs for classification, detection, and segmentation without the usual drop in accuracy that comes from mismatched sequence statistics.

Core claim

The central claim is that a single geometric principle—the recursive structure of the Hilbert curve—determines patch serialization, derives deterministic state-injection routes, and augments position encoding so that feature interactions reflect actual spatial proximity rather than 1D order, enabling Vision Mamba to scale across resolutions while preserving local neighborhoods.

What carries the argument

The Hilbert curve, a space-filling path whose recursive subdivisions keep nearby 2D patches close in the 1D sequence, applied here to create fractal serialization, hierarchy skip connections, and fractal-aware rotary position encoding.

If this is right

Performance improves over prior Mamba vision models on ImageNet-1K classification, with larger gains at high resolutions.
Detection and instance segmentation accuracy rises on COCO when inputs exceed training resolution.
Semantic segmentation on ADE20K and change detection on LEVIR-CD+ benefit similarly from the resolution-consistent ordering.
The skip connections and position encoding require no learned search or specialized kernels because they follow directly from the curve's recursion levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same curve-based ordering could be tested in other linear-time sequence models to see whether locality preservation helps beyond Mamba.
Extending the recursion to three dimensions might allow similar scaling for video or volumetric data without retraining per resolution.
If neighborhood consistency is the key mechanism, replacing the Hilbert curve with other locality-preserving space-filling curves could be compared on the same tasks.

Load-bearing premise

Hilbert-curve serialization maintains consistent neighborhood statistics when the image grid size changes.

What would settle it

A direct measurement of average distance between originally adjacent patches after serialization at several resolutions, followed by an ablation showing that the claimed performance gains disappear when those distances vary sharply.

Figures

Figures reproduced from arXiv: 2505.14062 by Bo Li, Haoke Xiao, Lv Tang.

**Figure 2.** Figure 2: Top-1 classification accuracy of Mamba-based models across different input resolutions on ImageNet-1K. Results are grouped by parameter scale: [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Architecture of the FractalMamba++ backbone. The design contains three Hilbert-geometry-driven components: Fractal-Aware 2D Rotary Position [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

Vision Mamba offers linear complexity for long visual sequences, yet its performance depends critically on how a two-dimensional patch grid is serialized into a one-dimensional state-space recurrence. Raster-style scans disrupt spatial continuity, and the mismatch between 2D locality and 1D state propagation becomes increasingly severe when the inference resolution grows beyond the training grid. This paper presents FractalMamba++, a resolution-scalable vision backbone organized around a single geometric principle: the recursive self-similar structure of the Hilbert curve determines how patches are serialized, where long-range state shortcuts are inserted, and how positional relations are encoded. First, Hilbert-curve-based Fractal Serialization preserves local 2D neighborhoods more faithfully than linear scans and provides consistent neighborhood statistics across resolutions. Second, the Fractal Hierarchy Skip Connection (FHSC) derives a compact set of deterministic state-injection routes from Hilbert recursion levels, mitigating long-sequence information fading without runtime search, hand-derived gradients, or dedicated CUDA kernels. Third, Fractal-Aware 2D Rotary Position Encoding (FA-RoPE) combines normalized 2D coordinates with a fractal hierarchy level so that feature interactions depend on actual spatial proximity and recursive structural role rather than serialized 1D distance. Extensive experiments on ImageNet-1K classification, COCO detection and instance segmentation, ADE20K semantic segmentation, and LEVIR-CD+ remote sensing change detection show that FractalMamba++ improves over existing Mamba-based vision backbones, especially under high-resolution inputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FractalMamba++ ties Hilbert curves to serialization, skip connections, and position encoding to help Vision Mamba handle higher resolutions, but the key locality claim lacks quantitative backing.

read the letter

The main point is that this paper uses the recursive structure of the Hilbert curve to serialize 2D patches into 1D sequences for Vision Mamba, derive deterministic skip connections, and modify 2D RoPE. The goal is to keep neighborhood relations more stable when inference resolution exceeds the training grid, which is a documented weak spot for these linear-complexity models. The three pieces—Fractal Serialization, FHSC, and FA-RoPE—are presented as coming from the same geometric source, which gives the design a unified feel rather than a collection of unrelated tweaks. Experiments on ImageNet-1K, COCO, ADE20K, and LEVIR-CD+ report gains that are larger at high resolutions, which matches the stated motivation. That coverage of tasks is reasonable for a backbone paper. The soft spot is the missing support for the central assumption. The abstract asserts that Hilbert serialization yields consistent neighborhood statistics across resolutions, yet supplies no locality metric, ablation on grid sizes, or comparison of 2D distances in the 1D order. Without those numbers the high-resolution improvements on detection and segmentation rest on unexamined choices, and the stress-test concern about shifting statistics holds up on the given evidence. The work is aimed at researchers extending Mamba-style models to high-resolution vision tasks. It shows clear engagement with the scaling limitation and offers a concrete alternative, so it deserves a serious referee even though more controls on the geometric claims would strengthen it. I would send it to review with a request for those ablations.

Referee Report

2 major / 1 minor

Summary. The paper introduces FractalMamba++, a resolution-scalable vision backbone for Mamba-based models that organizes serialization, state shortcuts, and positional encoding around the recursive self-similar structure of the Hilbert curve. It proposes three components—Hilbert-curve-based Fractal Serialization to preserve 2D neighborhoods with consistent statistics across resolutions, Fractal Hierarchy Skip Connection (FHSC) for deterministic state-injection routes derived from recursion levels, and Fractal-Aware 2D Rotary Position Encoding (FA-RoPE) that incorporates normalized 2D coordinates and fractal hierarchy levels—and reports empirical gains over prior Mamba vision backbones on ImageNet-1K classification, COCO detection/instance segmentation, ADE20K semantic segmentation, and LEVIR-CD+ change detection, with particular emphasis on high-resolution inputs.

Significance. If the neighborhood-consistency property of Hilbert serialization is quantitatively validated and the reported gains prove robust, the work would supply a deterministic, parameter-light geometric mechanism for scaling state-space vision models to arbitrary resolutions without retraining or custom kernels, addressing a recognized limitation in current Vision Mamba designs and offering a reproducible template for other long-sequence visual tasks.

major comments (2)

[Abstract] Abstract: The central claim that Hilbert-curve serialization 'preserves local 2D neighborhoods more faithfully than linear scans and provides consistent neighborhood statistics across resolutions' is asserted as the geometric foundation for both serialization and position encoding, yet no locality metric (e.g., average 2D Euclidean distance of k-nearest serialized neighbors), ablation, or comparison across grid sizes is supplied; this assumption is load-bearing for the high-resolution gains claimed on COCO, ADE20K, and LEVIR-CD+.
[Experiments] Experiments section: The manuscript reports improvements across four benchmarks but supplies no quantitative details on ablation controls, error bars, or the precise protocol used for training-to-inference resolution scaling; without these, the attribution of gains specifically to the three new components cannot be rigorously assessed.

minor comments (1)

[Abstract] Abstract: The acronyms FHSC and FA-RoPE are introduced without a one-sentence parenthetical gloss, which would aid readers unfamiliar with the method.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects that can strengthen the presentation of our geometric approach and experimental validation. We address each major comment point by point below and indicate the revisions planned for the next version.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that Hilbert-curve serialization 'preserves local 2D neighborhoods more faithfully than linear scans and provides consistent neighborhood statistics across resolutions' is asserted as the geometric foundation for both serialization and position encoding, yet no locality metric (e.g., average 2D Euclidean distance of k-nearest serialized neighbors), ablation, or comparison across grid sizes is supplied; this assumption is load-bearing for the high-resolution gains claimed on COCO, ADE20K, and LEVIR-CD+.

Authors: We agree that a direct quantitative locality metric would provide stronger, more explicit support for the geometric foundation. In the revised manuscript we will add a dedicated analysis (new subsection or appendix figure) that reports the average 2D Euclidean distance of the k-nearest serialized neighbors (for k=4,8) under Hilbert versus raster serialization, computed on grids of varying sizes (14×14, 28×28, 56×56). We will also include a short ablation that isolates the contribution of this locality property to high-resolution downstream performance. These additions will make the load-bearing assumption directly verifiable. revision: yes
Referee: [Experiments] Experiments section: The manuscript reports improvements across four benchmarks but supplies no quantitative details on ablation controls, error bars, or the precise protocol used for training-to-inference resolution scaling; without these, the attribution of gains specifically to the three new components cannot be rigorously assessed.

Authors: We concur that additional experimental rigor is required. In the revision we will expand the Experiments section to include: (i) full ablation tables quantifying the incremental contribution of each component (Fractal Serialization, FHSC, FA-RoPE) on all four benchmarks; (ii) mean and standard deviation over at least three independent runs with different random seeds; and (iii) an explicit protocol subsection describing the training resolution (224²), the exact higher inference resolutions tested, and the deterministic interpolation/padding procedure used for resolution scaling without retraining. These details will allow readers to assess attribution of the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity: design choices are independent geometric constructions with external experimental validation

full rationale

The paper's core claims rest on three explicitly introduced components (Fractal Serialization, FHSC, FA-RoPE) whose definitions are derived directly from the known recursive properties of the Hilbert curve rather than from any fitted parameter or self-referential equation. The abstract states the neighborhood-consistency property as a geometric fact about Hilbert curves and then reports downstream empirical gains on ImageNet, COCO, ADE20K and LEVIR-CD+; none of these gains are shown to be algebraically forced by the same quantities used to define the components. No self-citations, uniqueness theorems, or ansatzes from prior author work appear in the provided text, and no prediction is obtained by fitting a subset of the target metric. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the untested premise that Hilbert recursion supplies consistent 2D neighborhood statistics at every scale and that the derived skip routes and position encoding inherit that property without additional fitting.

axioms (1)

domain assumption Hilbert-curve-based serialization preserves local 2D neighborhoods more faithfully than linear scans and provides consistent neighborhood statistics across resolutions
Invoked in the abstract as the single geometric principle organizing serialization, skip connections, and position encoding.

invented entities (2)

Fractal Hierarchy Skip Connection (FHSC) no independent evidence
purpose: Derives deterministic state-injection routes from Hilbert recursion levels to mitigate long-sequence information fading
Newly introduced component whose routes are claimed to be free of runtime search or hand-derived gradients.
Fractal-Aware 2D Rotary Position Encoding (FA-RoPE) no independent evidence
purpose: Combines normalized 2D coordinates with fractal hierarchy level so feature interactions depend on spatial proximity and recursive structural role
Newly introduced encoding whose dependence on fractal level is presented as the key to resolution consistency.

pith-pipeline@v0.9.0 · 5802 in / 1490 out tokens · 35563 ms · 2026-05-22T13:54:44.433694+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the recursive self-similar structure of the Hilbert curve determines how patches are serialized, where long-range state shortcuts are inserted, and how positional relations are encoded... provides consistent neighborhood statistics across resolutions
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

FHSC selects one representative pair of spatially adjacent but sequentially distant sibling segments at each recursion level... E = union over l=1 to L of {(mid(S(1)_l), mid(S(4)_l))}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 9 internal anchors

[1]

BERT: pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” inNAACL, 2019, pp. 4171–4186

work page 2019
[2]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inICML, vol. 139. PMLR, 2021, pp. 8748–8763

work page 2021
[3]

BLIP: bootstrapping language- image pre-training for unified vision-language understanding and gen- eration,

J. Li, D. Li, C. Xiong, and S. C. H. Hoi, “BLIP: bootstrapping language- image pre-training for unified vision-language understanding and gen- eration,” inICML, ser. Proceedings of Machine Learning Research, vol

work page
[4]

12 888–12 900

PMLR, 2022, pp. 12 888–12 900

work page 2022
[5]

Palm: Scaling language modeling with pathways,

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y . Tay, N. Shazeer, V . Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghe- mawat, S. De...

work page 2023
[6]

GPT-4 Technical Report

OpenAI, “GPT-4 technical report,”CoRR, vol. abs/2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,”CoRR, vol. abs/2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P. Huang, S. Li, I. Misra, M. G. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. J ´egou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features without superv...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Dollar, and R. Girshick, “Segment anything,” inICCV, October 2023, pp. 4015–4026. SUBMIT TO IEEE TRANSACTIONS ON MULTIMEDIA 10

work page 2023
[10]

BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,” inICML, vol. 202, 2023, pp. 19 730–19 742

work page 2023
[11]

Beta-tuned timestep diffusion model,

T. Zheng, P. Jiang, B. Wan, H. Zhang, J. Chen, J. Wang, and B. Li, “Beta-tuned timestep diffusion model,” inECCV (3), ser. Lecture Notes in Computer Science, vol. 15061. Springer, 2024, pp. 114–130

work page 2024
[12]

SAM 2: Segment Anything in Images and Videos

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Doll ´ar, and C. Feichtenhofer, “Sam 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024. [Online]. Available: https://arxiv.org/abs/2408. 00714

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Z. Chen, W. Wang, H. Tian, S. Ye, Z. Gao, E. Cui, W. Tong, K. Hu, J. Luo, Z. Maet al., “How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites,”arXiv preprint arXiv:2404.16821, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Foundation models defining a new era in vision: a survey and outlook,

M. Awais, M. Naseer, S. Khan, R. M. Anwer, H. Cholakkal, M. Shah, M.-H. Yang, and F. S. Khan, “Foundation models defining a new era in vision: a survey and outlook,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[15]

Graph foundation models: Concepts, opportunities and challenges,

J. Liu, C. Yang, Z. Lu, J. Chen, Y . Li, M. Zhang, T. Bai, Y . Fang, L. Sun, P. S. Yuet al., “Graph foundation models: Concepts, opportunities and challenges,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[16]

Uni-moe: Scaling unified multimodal llms with mixture of experts,

Y . Li, S. Jiang, B. Hu, L. Wang, W. Zhong, W. Luo, L. Ma, and M. Zhang, “Uni-moe: Scaling unified multimodal llms with mixture of experts,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 5, pp. 3424–3439, 2025

work page 2025
[17]

LISA: reasoning segmentation via large language model,

X. Lai, Z. Tian, Y . Chen, Y . Li, Y . Yuan, S. Liu, and J. Jia, “LISA: reasoning segmentation via large language model,” inCVPR. IEEE, 2024, pp. 9579–9589

work page 2024
[18]

Towards training-free open-world segmentation via image prompt foundation models,

L. Tang, P. Jiang, H. Xiao, and B. Li, “Towards training-free open-world segmentation via image prompt foundation models,”Int. J. Comput. Vis., vol. 133, no. 1, pp. 1–15, 2025

work page 2025
[19]

Vargpt-v1.1: Improve visual autoregressive large unified model via iterative instruction tuning and reinforcement learning,

X. Zhuang, Y . Xie, Y . Deng, D. Yang, L. Liang, J. Ru, Y . Yin, and Y . Zou, “Vargpt-v1.1: Improve visual autoregressive large unified model via iterative instruction tuning and reinforcement learning,”arXiv preprint arXiv:2504.02949, 2025

work page arXiv 2025
[20]

Advances in neural in- formation processing systems, 35:27730–27744

J. Pan, C. Liu, J. Wu, F. Liu, J. Zhu, H. B. Li, C. Chen, C. Ouyang, and D. Rueckert, “Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning,”CoRR, vol. abs/2502.19634, 2025

work page arXiv 2025
[21]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inNeurIPS, 2017, pp. 5998–6008

work page 2017
[22]

Transformers are ssms: Generalized models and efficient algorithms through structured state space duality,

T. Dao and A. Gu, “Transformers are ssms: Generalized models and efficient algorithms through structured state space duality,” inICML. OpenReview.net, 2024

work page 2024
[23]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”CoRR, vol. abs/2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Localmamba: Visual state space model with windowed selective scan

T. Huang, X. Pei, S. You, F. Wang, C. Qian, and C. Xu, “Localmamba: Visual state space model with windowed selective scan,”CoRR, vol. abs/2403.09338, 2024

work page arXiv 2024
[25]

Plainmamba: Improving non- hierarchical mamba in visual recognition

C. Yang, Z. Chen, M. Espinosa, L. Ericsson, Z. Wang, J. Liu, and E. J. Crowley, “Plainmamba: Improving non-hierarchical mamba in visual recognition,”CoRR, vol. abs/2403.17695, 2024

work page arXiv 2024
[26]

Vision mamba: Efficient visual representation learning with bidirectional state space model,

L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,” inICML. OpenReview.net, 2024

work page 2024
[27]

Vmamba: Visual state space model,

Y . Liu, Y . Tian, Y . Zhao, H. Yu, L. Xie, Y . Wang, Q. Ye, J. Jiao, and Y . Liu, “Vmamba: Visual state space model,” inNeurIPS, 2024

work page 2024
[28]

Grootvl: Tree topology is all you need in state space model,

Y . Xiao, L. Song, S. Huang, J. Wang, S. Song, Y . Ge, X. Li, and Y . Shan, “Grootvl: Tree topology is all you need in state space model,”CoRR, vol. abs/2406.02395, 2024

work page arXiv 2024
[29]

Resformer: Scaling vits with multi-resolution training,

R. Tian, Z. Wu, Q. Dai, H. Hu, Y . Qiao, and Y . Jiang, “Resformer: Scaling vits with multi-resolution training,” inCVPR. IEEE, 2023, pp. 22 721–22 731

work page 2023
[30]

Demystify mamba in vision: A linear attention perspective,

D. Han, Z. Wang, Z. Xia, Y . Han, Y . Pu, C. Ge, J. Song, S. Song, B. Zheng, and G. Huang, “Demystify mamba in vision: A linear attention perspective,” inNeurIPS, 2024

work page 2024
[31]

Multi-scale vmamba: Hierarchy in hierarchy visual state space model,

Y . Shi, M. Dong, and C. Xu, “Multi-scale vmamba: Hierarchy in hierarchy visual state space model,” inNeurIPS, 2024

work page 2024
[32]

Spatial-mamba: Effective visual state space models via structure-aware state fusion,

C. Xiao, M. Li, Z. Zhang, D. Meng, and L. Zhang, “Spatial-mamba: Effective visual state space models via structure-aware state fusion,” CoRR, vol. abs/2410.15091, 2024

work page arXiv 2024
[33]

MambaVision: A hybrid mamba- transformer vision backbone,

A. Hatamizadeh and J. Kautz, “Mambavision: A hybrid mamba- transformer vision backbone,”CoRR, vol. abs/2407.08083, 2024

work page arXiv 2024
[34]

Mamba-r: Vision mamba ALSO needs registers,

F. Wang, J. Wang, S. Ren, G. Wei, J. Mei, W. Shao, Y . Zhou, A. L. Yuille, and C. Xie, “Mamba-r: Vision mamba ALSO needs registers,” CoRR, vol. abs/2405.14858, 2024

work page arXiv 2024
[35]

Efficientvmamba: Atrous selective scan for light weight visual mamba,

X. Pei, T. Huang, and C. Xu, “Efficientvmamba: Atrous selective scan for light weight visual mamba,” inAAAI. AAAI Press, 2025, pp. 6443– 6451

work page 2025
[36]

Boosting vision state space model with fractal scanning,

H. Xiao, L. Tang, P.-t. Jiang, H. Zhang, J. Chen, and B. Li, “Boosting vision state space model with fractal scanning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 8, 2025, pp. 8646–8654

work page 2025
[37]

Imagenet classification with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inNIPS, 2012, pp. 1106– 1114

work page 2012
[38]

Very deep convolutional networks for large-scale image recognition,

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” inICLR, 2015

work page 2015
[39]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inCVPR. IEEE, 2016, pp. 770–778

work page 2016
[40]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,”CoRR, vol. abs/1704.04861, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

Designing network design spaces,

I. Radosavovic, R. P. Kosaraju, R. B. Girshick, K. He, and P. Doll ´ar, “Designing network design spaces,” inCVPR. IEEE, 2020, pp. 10 425– 10 433

work page 2020
[42]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inICLR. OpenReview.net, 2021

work page 2021
[43]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inICCV. IEEE, 2021, pp. 9992–10 002

work page 2021
[44]

Deit III: revenge of the vit,

H. Touvron, M. Cord, and H. J ´egou, “Deit III: revenge of the vit,” in ECCV, vol. 13684. Springer, 2022, pp. 516–533

work page 2022
[45]

Efficiently modeling long sequences with structured state spaces,

A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,” inICLR. OpenReview.net, 2022

work page 2022
[46]

Scalable autoregressive image generation with mamba,

H. Li, J. Yang, K. Wang, X. Qiu, Y . Chou, X. Li, and G. Li, “Scalable autoregressive image generation with mamba,”CoRR, vol. abs/2408.12245, 2024

work page arXiv 2024
[47]

Cobra: Extending mamba to multi-modal large language model for efficient inference,

H. Zhao, M. Zhang, W. Zhao, P. Ding, S. Huang, and D. Wang, “Cobra: Extending mamba to multi-modal large language model for efficient inference,” inAAAI. AAAI Press, 2025, pp. 10 421–10 429

work page 2025
[48]

Roformer: Enhanced transformer with rotary position embedding,

J. Su, M. H. M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,”Neurocomput- ing, vol. 568, p. 127063, 2024

work page 2024
[49]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” inCVPR. IEEE, 2009, pp. 248–255

work page 2009
[50]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[51]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inECCV. Springer, 2014, pp. 740–755

work page 2014
[52]

Mask R-CNN,

K. He, G. Gkioxari, P. Doll ´ar, and R. B. Girshick, “Mask R-CNN,” in ICCV. IEEE Computer Society, 2017, pp. 2980–2988

work page 2017
[53]

MMDetection: Open MMLab Detection Toolbox and Benchmark

K. Chen, J. Wang, J. Pang, Y . Cao, Y . Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xuet al., “Mmdetection: Open mmlab detection toolbox and benchmark,”arXiv preprint arXiv:1906.07155, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[54]

Defmamba: Deformable visual state space model,

L. Liu, M. Zhang, J. Yin, T. Liu, W. Ji, Y . Piao, and H. Lu, “Defmamba: Deformable visual state space model,”arXiv preprint arXiv:2504.05794, 2025

work page arXiv 2025
[55]

Semantic understanding of scenes through the ade20k dataset,

B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,”International Journal of Computer Vision, vol. 127, no. 3, pp. 302–321, 2019

work page 2019
[56]

Unified perceptual parsing for scene understanding,

T. Xiao, Y . Liu, B. Zhou, Y . Jiang, and J. Sun, “Unified perceptual parsing for scene understanding,” inECCV, 2018, pp. 418–434

work page 2018
[57]

Changemamba: Re- mote sensing change detection with spatio-temporal state space model. arxiv 2024,

H. Chen, J. Song, C. Han, J. Xia, and N. Yokoya, “Changemamba: Re- mote sensing change detection with spatio-temporal state space model. arxiv 2024,”arXiv preprint arXiv:2404.03425, 2024

work page arXiv 2024
[58]

Pyramid grafting network for one-stage high resolution saliency detection,

C. Xie, C. Xia, M. Ma, Z. Zhao, X. Chen, and J. Li, “Pyramid grafting network for one-stage high resolution saliency detection,” inCVPR. IEEE, 2022, pp. 11 707–11 716

work page 2022

[1] [1]

BERT: pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” inNAACL, 2019, pp. 4171–4186

work page 2019

[2] [2]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inICML, vol. 139. PMLR, 2021, pp. 8748–8763

work page 2021

[3] [3]

BLIP: bootstrapping language- image pre-training for unified vision-language understanding and gen- eration,

J. Li, D. Li, C. Xiong, and S. C. H. Hoi, “BLIP: bootstrapping language- image pre-training for unified vision-language understanding and gen- eration,” inICML, ser. Proceedings of Machine Learning Research, vol

work page

[4] [4]

12 888–12 900

PMLR, 2022, pp. 12 888–12 900

work page 2022

[5] [5]

Palm: Scaling language modeling with pathways,

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y . Tay, N. Shazeer, V . Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghe- mawat, S. De...

work page 2023

[6] [6]

GPT-4 Technical Report

OpenAI, “GPT-4 technical report,”CoRR, vol. abs/2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,”CoRR, vol. abs/2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P. Huang, S. Li, I. Misra, M. G. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. J ´egou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features without superv...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Dollar, and R. Girshick, “Segment anything,” inICCV, October 2023, pp. 4015–4026. SUBMIT TO IEEE TRANSACTIONS ON MULTIMEDIA 10

work page 2023

[10] [10]

BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,” inICML, vol. 202, 2023, pp. 19 730–19 742

work page 2023

[11] [11]

Beta-tuned timestep diffusion model,

T. Zheng, P. Jiang, B. Wan, H. Zhang, J. Chen, J. Wang, and B. Li, “Beta-tuned timestep diffusion model,” inECCV (3), ser. Lecture Notes in Computer Science, vol. 15061. Springer, 2024, pp. 114–130

work page 2024

[12] [12]

SAM 2: Segment Anything in Images and Videos

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Doll ´ar, and C. Feichtenhofer, “Sam 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024. [Online]. Available: https://arxiv.org/abs/2408. 00714

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Z. Chen, W. Wang, H. Tian, S. Ye, Z. Gao, E. Cui, W. Tong, K. Hu, J. Luo, Z. Maet al., “How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites,”arXiv preprint arXiv:2404.16821, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Foundation models defining a new era in vision: a survey and outlook,

M. Awais, M. Naseer, S. Khan, R. M. Anwer, H. Cholakkal, M. Shah, M.-H. Yang, and F. S. Khan, “Foundation models defining a new era in vision: a survey and outlook,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025

[15] [15]

Graph foundation models: Concepts, opportunities and challenges,

J. Liu, C. Yang, Z. Lu, J. Chen, Y . Li, M. Zhang, T. Bai, Y . Fang, L. Sun, P. S. Yuet al., “Graph foundation models: Concepts, opportunities and challenges,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025

[16] [16]

Uni-moe: Scaling unified multimodal llms with mixture of experts,

Y . Li, S. Jiang, B. Hu, L. Wang, W. Zhong, W. Luo, L. Ma, and M. Zhang, “Uni-moe: Scaling unified multimodal llms with mixture of experts,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 5, pp. 3424–3439, 2025

work page 2025

[17] [17]

LISA: reasoning segmentation via large language model,

X. Lai, Z. Tian, Y . Chen, Y . Li, Y . Yuan, S. Liu, and J. Jia, “LISA: reasoning segmentation via large language model,” inCVPR. IEEE, 2024, pp. 9579–9589

work page 2024

[18] [18]

Towards training-free open-world segmentation via image prompt foundation models,

L. Tang, P. Jiang, H. Xiao, and B. Li, “Towards training-free open-world segmentation via image prompt foundation models,”Int. J. Comput. Vis., vol. 133, no. 1, pp. 1–15, 2025

work page 2025

[19] [19]

Vargpt-v1.1: Improve visual autoregressive large unified model via iterative instruction tuning and reinforcement learning,

X. Zhuang, Y . Xie, Y . Deng, D. Yang, L. Liang, J. Ru, Y . Yin, and Y . Zou, “Vargpt-v1.1: Improve visual autoregressive large unified model via iterative instruction tuning and reinforcement learning,”arXiv preprint arXiv:2504.02949, 2025

work page arXiv 2025

[20] [20]

Advances in neural in- formation processing systems, 35:27730–27744

J. Pan, C. Liu, J. Wu, F. Liu, J. Zhu, H. B. Li, C. Chen, C. Ouyang, and D. Rueckert, “Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning,”CoRR, vol. abs/2502.19634, 2025

work page arXiv 2025

[21] [21]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inNeurIPS, 2017, pp. 5998–6008

work page 2017

[22] [22]

Transformers are ssms: Generalized models and efficient algorithms through structured state space duality,

T. Dao and A. Gu, “Transformers are ssms: Generalized models and efficient algorithms through structured state space duality,” inICML. OpenReview.net, 2024

work page 2024

[23] [23]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”CoRR, vol. abs/2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Localmamba: Visual state space model with windowed selective scan

T. Huang, X. Pei, S. You, F. Wang, C. Qian, and C. Xu, “Localmamba: Visual state space model with windowed selective scan,”CoRR, vol. abs/2403.09338, 2024

work page arXiv 2024

[25] [25]

Plainmamba: Improving non- hierarchical mamba in visual recognition

C. Yang, Z. Chen, M. Espinosa, L. Ericsson, Z. Wang, J. Liu, and E. J. Crowley, “Plainmamba: Improving non-hierarchical mamba in visual recognition,”CoRR, vol. abs/2403.17695, 2024

work page arXiv 2024

[26] [26]

Vision mamba: Efficient visual representation learning with bidirectional state space model,

L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,” inICML. OpenReview.net, 2024

work page 2024

[27] [27]

Vmamba: Visual state space model,

Y . Liu, Y . Tian, Y . Zhao, H. Yu, L. Xie, Y . Wang, Q. Ye, J. Jiao, and Y . Liu, “Vmamba: Visual state space model,” inNeurIPS, 2024

work page 2024

[28] [28]

Grootvl: Tree topology is all you need in state space model,

Y . Xiao, L. Song, S. Huang, J. Wang, S. Song, Y . Ge, X. Li, and Y . Shan, “Grootvl: Tree topology is all you need in state space model,”CoRR, vol. abs/2406.02395, 2024

work page arXiv 2024

[29] [29]

Resformer: Scaling vits with multi-resolution training,

R. Tian, Z. Wu, Q. Dai, H. Hu, Y . Qiao, and Y . Jiang, “Resformer: Scaling vits with multi-resolution training,” inCVPR. IEEE, 2023, pp. 22 721–22 731

work page 2023

[30] [30]

Demystify mamba in vision: A linear attention perspective,

D. Han, Z. Wang, Z. Xia, Y . Han, Y . Pu, C. Ge, J. Song, S. Song, B. Zheng, and G. Huang, “Demystify mamba in vision: A linear attention perspective,” inNeurIPS, 2024

work page 2024

[31] [31]

Multi-scale vmamba: Hierarchy in hierarchy visual state space model,

Y . Shi, M. Dong, and C. Xu, “Multi-scale vmamba: Hierarchy in hierarchy visual state space model,” inNeurIPS, 2024

work page 2024

[32] [32]

Spatial-mamba: Effective visual state space models via structure-aware state fusion,

C. Xiao, M. Li, Z. Zhang, D. Meng, and L. Zhang, “Spatial-mamba: Effective visual state space models via structure-aware state fusion,” CoRR, vol. abs/2410.15091, 2024

work page arXiv 2024

[33] [33]

MambaVision: A hybrid mamba- transformer vision backbone,

A. Hatamizadeh and J. Kautz, “Mambavision: A hybrid mamba- transformer vision backbone,”CoRR, vol. abs/2407.08083, 2024

work page arXiv 2024

[34] [34]

Mamba-r: Vision mamba ALSO needs registers,

F. Wang, J. Wang, S. Ren, G. Wei, J. Mei, W. Shao, Y . Zhou, A. L. Yuille, and C. Xie, “Mamba-r: Vision mamba ALSO needs registers,” CoRR, vol. abs/2405.14858, 2024

work page arXiv 2024

[35] [35]

Efficientvmamba: Atrous selective scan for light weight visual mamba,

X. Pei, T. Huang, and C. Xu, “Efficientvmamba: Atrous selective scan for light weight visual mamba,” inAAAI. AAAI Press, 2025, pp. 6443– 6451

work page 2025

[36] [36]

Boosting vision state space model with fractal scanning,

H. Xiao, L. Tang, P.-t. Jiang, H. Zhang, J. Chen, and B. Li, “Boosting vision state space model with fractal scanning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 8, 2025, pp. 8646–8654

work page 2025

[37] [37]

Imagenet classification with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inNIPS, 2012, pp. 1106– 1114

work page 2012

[38] [38]

Very deep convolutional networks for large-scale image recognition,

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” inICLR, 2015

work page 2015

[39] [39]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inCVPR. IEEE, 2016, pp. 770–778

work page 2016

[40] [40]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,”CoRR, vol. abs/1704.04861, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[41] [41]

Designing network design spaces,

I. Radosavovic, R. P. Kosaraju, R. B. Girshick, K. He, and P. Doll ´ar, “Designing network design spaces,” inCVPR. IEEE, 2020, pp. 10 425– 10 433

work page 2020

[42] [42]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inICLR. OpenReview.net, 2021

work page 2021

[43] [43]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inICCV. IEEE, 2021, pp. 9992–10 002

work page 2021

[44] [44]

Deit III: revenge of the vit,

H. Touvron, M. Cord, and H. J ´egou, “Deit III: revenge of the vit,” in ECCV, vol. 13684. Springer, 2022, pp. 516–533

work page 2022

[45] [45]

Efficiently modeling long sequences with structured state spaces,

A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,” inICLR. OpenReview.net, 2022

work page 2022

[46] [46]

Scalable autoregressive image generation with mamba,

H. Li, J. Yang, K. Wang, X. Qiu, Y . Chou, X. Li, and G. Li, “Scalable autoregressive image generation with mamba,”CoRR, vol. abs/2408.12245, 2024

work page arXiv 2024

[47] [47]

Cobra: Extending mamba to multi-modal large language model for efficient inference,

H. Zhao, M. Zhang, W. Zhao, P. Ding, S. Huang, and D. Wang, “Cobra: Extending mamba to multi-modal large language model for efficient inference,” inAAAI. AAAI Press, 2025, pp. 10 421–10 429

work page 2025

[48] [48]

Roformer: Enhanced transformer with rotary position embedding,

J. Su, M. H. M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,”Neurocomput- ing, vol. 568, p. 127063, 2024

work page 2024

[49] [49]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” inCVPR. IEEE, 2009, pp. 248–255

work page 2009

[50] [50]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[51] [51]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inECCV. Springer, 2014, pp. 740–755

work page 2014

[52] [52]

Mask R-CNN,

K. He, G. Gkioxari, P. Doll ´ar, and R. B. Girshick, “Mask R-CNN,” in ICCV. IEEE Computer Society, 2017, pp. 2980–2988

work page 2017

[53] [53]

MMDetection: Open MMLab Detection Toolbox and Benchmark

K. Chen, J. Wang, J. Pang, Y . Cao, Y . Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xuet al., “Mmdetection: Open mmlab detection toolbox and benchmark,”arXiv preprint arXiv:1906.07155, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906

[54] [54]

Defmamba: Deformable visual state space model,

L. Liu, M. Zhang, J. Yin, T. Liu, W. Ji, Y . Piao, and H. Lu, “Defmamba: Deformable visual state space model,”arXiv preprint arXiv:2504.05794, 2025

work page arXiv 2025

[55] [55]

Semantic understanding of scenes through the ade20k dataset,

B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,”International Journal of Computer Vision, vol. 127, no. 3, pp. 302–321, 2019

work page 2019

[56] [56]

Unified perceptual parsing for scene understanding,

T. Xiao, Y . Liu, B. Zhou, Y . Jiang, and J. Sun, “Unified perceptual parsing for scene understanding,” inECCV, 2018, pp. 418–434

work page 2018

[57] [57]

Changemamba: Re- mote sensing change detection with spatio-temporal state space model. arxiv 2024,

H. Chen, J. Song, C. Han, J. Xia, and N. Yokoya, “Changemamba: Re- mote sensing change detection with spatio-temporal state space model. arxiv 2024,”arXiv preprint arXiv:2404.03425, 2024

work page arXiv 2024

[58] [58]

Pyramid grafting network for one-stage high resolution saliency detection,

C. Xie, C. Xia, M. Ma, Z. Zhao, X. Chen, and J. Li, “Pyramid grafting network for one-stage high resolution saliency detection,” inCVPR. IEEE, 2022, pp. 11 707–11 716

work page 2022