FractalMamba++: Scaling Vision Mamba Across Resolutions via Hilbert Fractal Geometry
Pith reviewed 2026-05-22 13:54 UTC · model grok-4.3
The pith
Hilbert fractal curves let Vision Mamba models keep spatial continuity when input resolution changes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a single geometric principle—the recursive structure of the Hilbert curve—determines patch serialization, derives deterministic state-injection routes, and augments position encoding so that feature interactions reflect actual spatial proximity rather than 1D order, enabling Vision Mamba to scale across resolutions while preserving local neighborhoods.
What carries the argument
The Hilbert curve, a space-filling path whose recursive subdivisions keep nearby 2D patches close in the 1D sequence, applied here to create fractal serialization, hierarchy skip connections, and fractal-aware rotary position encoding.
If this is right
- Performance improves over prior Mamba vision models on ImageNet-1K classification, with larger gains at high resolutions.
- Detection and instance segmentation accuracy rises on COCO when inputs exceed training resolution.
- Semantic segmentation on ADE20K and change detection on LEVIR-CD+ benefit similarly from the resolution-consistent ordering.
- The skip connections and position encoding require no learned search or specialized kernels because they follow directly from the curve's recursion levels.
Where Pith is reading between the lines
- The same curve-based ordering could be tested in other linear-time sequence models to see whether locality preservation helps beyond Mamba.
- Extending the recursion to three dimensions might allow similar scaling for video or volumetric data without retraining per resolution.
- If neighborhood consistency is the key mechanism, replacing the Hilbert curve with other locality-preserving space-filling curves could be compared on the same tasks.
Load-bearing premise
Hilbert-curve serialization maintains consistent neighborhood statistics when the image grid size changes.
What would settle it
A direct measurement of average distance between originally adjacent patches after serialization at several resolutions, followed by an ablation showing that the claimed performance gains disappear when those distances vary sharply.
Figures
read the original abstract
Vision Mamba offers linear complexity for long visual sequences, yet its performance depends critically on how a two-dimensional patch grid is serialized into a one-dimensional state-space recurrence. Raster-style scans disrupt spatial continuity, and the mismatch between 2D locality and 1D state propagation becomes increasingly severe when the inference resolution grows beyond the training grid. This paper presents FractalMamba++, a resolution-scalable vision backbone organized around a single geometric principle: the recursive self-similar structure of the Hilbert curve determines how patches are serialized, where long-range state shortcuts are inserted, and how positional relations are encoded. First, Hilbert-curve-based Fractal Serialization preserves local 2D neighborhoods more faithfully than linear scans and provides consistent neighborhood statistics across resolutions. Second, the Fractal Hierarchy Skip Connection (FHSC) derives a compact set of deterministic state-injection routes from Hilbert recursion levels, mitigating long-sequence information fading without runtime search, hand-derived gradients, or dedicated CUDA kernels. Third, Fractal-Aware 2D Rotary Position Encoding (FA-RoPE) combines normalized 2D coordinates with a fractal hierarchy level so that feature interactions depend on actual spatial proximity and recursive structural role rather than serialized 1D distance. Extensive experiments on ImageNet-1K classification, COCO detection and instance segmentation, ADE20K semantic segmentation, and LEVIR-CD+ remote sensing change detection show that FractalMamba++ improves over existing Mamba-based vision backbones, especially under high-resolution inputs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FractalMamba++, a resolution-scalable vision backbone for Mamba-based models that organizes serialization, state shortcuts, and positional encoding around the recursive self-similar structure of the Hilbert curve. It proposes three components—Hilbert-curve-based Fractal Serialization to preserve 2D neighborhoods with consistent statistics across resolutions, Fractal Hierarchy Skip Connection (FHSC) for deterministic state-injection routes derived from recursion levels, and Fractal-Aware 2D Rotary Position Encoding (FA-RoPE) that incorporates normalized 2D coordinates and fractal hierarchy levels—and reports empirical gains over prior Mamba vision backbones on ImageNet-1K classification, COCO detection/instance segmentation, ADE20K semantic segmentation, and LEVIR-CD+ change detection, with particular emphasis on high-resolution inputs.
Significance. If the neighborhood-consistency property of Hilbert serialization is quantitatively validated and the reported gains prove robust, the work would supply a deterministic, parameter-light geometric mechanism for scaling state-space vision models to arbitrary resolutions without retraining or custom kernels, addressing a recognized limitation in current Vision Mamba designs and offering a reproducible template for other long-sequence visual tasks.
major comments (2)
- [Abstract] Abstract: The central claim that Hilbert-curve serialization 'preserves local 2D neighborhoods more faithfully than linear scans and provides consistent neighborhood statistics across resolutions' is asserted as the geometric foundation for both serialization and position encoding, yet no locality metric (e.g., average 2D Euclidean distance of k-nearest serialized neighbors), ablation, or comparison across grid sizes is supplied; this assumption is load-bearing for the high-resolution gains claimed on COCO, ADE20K, and LEVIR-CD+.
- [Experiments] Experiments section: The manuscript reports improvements across four benchmarks but supplies no quantitative details on ablation controls, error bars, or the precise protocol used for training-to-inference resolution scaling; without these, the attribution of gains specifically to the three new components cannot be rigorously assessed.
minor comments (1)
- [Abstract] Abstract: The acronyms FHSC and FA-RoPE are introduced without a one-sentence parenthetical gloss, which would aid readers unfamiliar with the method.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects that can strengthen the presentation of our geometric approach and experimental validation. We address each major comment point by point below and indicate the revisions planned for the next version.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that Hilbert-curve serialization 'preserves local 2D neighborhoods more faithfully than linear scans and provides consistent neighborhood statistics across resolutions' is asserted as the geometric foundation for both serialization and position encoding, yet no locality metric (e.g., average 2D Euclidean distance of k-nearest serialized neighbors), ablation, or comparison across grid sizes is supplied; this assumption is load-bearing for the high-resolution gains claimed on COCO, ADE20K, and LEVIR-CD+.
Authors: We agree that a direct quantitative locality metric would provide stronger, more explicit support for the geometric foundation. In the revised manuscript we will add a dedicated analysis (new subsection or appendix figure) that reports the average 2D Euclidean distance of the k-nearest serialized neighbors (for k=4,8) under Hilbert versus raster serialization, computed on grids of varying sizes (14×14, 28×28, 56×56). We will also include a short ablation that isolates the contribution of this locality property to high-resolution downstream performance. These additions will make the load-bearing assumption directly verifiable. revision: yes
-
Referee: [Experiments] Experiments section: The manuscript reports improvements across four benchmarks but supplies no quantitative details on ablation controls, error bars, or the precise protocol used for training-to-inference resolution scaling; without these, the attribution of gains specifically to the three new components cannot be rigorously assessed.
Authors: We concur that additional experimental rigor is required. In the revision we will expand the Experiments section to include: (i) full ablation tables quantifying the incremental contribution of each component (Fractal Serialization, FHSC, FA-RoPE) on all four benchmarks; (ii) mean and standard deviation over at least three independent runs with different random seeds; and (iii) an explicit protocol subsection describing the training resolution (224²), the exact higher inference resolutions tested, and the deterministic interpolation/padding procedure used for resolution scaling without retraining. These details will allow readers to assess attribution of the reported gains. revision: yes
Circularity Check
No circularity: design choices are independent geometric constructions with external experimental validation
full rationale
The paper's core claims rest on three explicitly introduced components (Fractal Serialization, FHSC, FA-RoPE) whose definitions are derived directly from the known recursive properties of the Hilbert curve rather than from any fitted parameter or self-referential equation. The abstract states the neighborhood-consistency property as a geometric fact about Hilbert curves and then reports downstream empirical gains on ImageNet, COCO, ADE20K and LEVIR-CD+; none of these gains are shown to be algebraically forced by the same quantities used to define the components. No self-citations, uniqueness theorems, or ansatzes from prior author work appear in the provided text, and no prediction is obtained by fitting a subset of the target metric. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Hilbert-curve-based serialization preserves local 2D neighborhoods more faithfully than linear scans and provides consistent neighborhood statistics across resolutions
invented entities (2)
-
Fractal Hierarchy Skip Connection (FHSC)
no independent evidence
-
Fractal-Aware 2D Rotary Position Encoding (FA-RoPE)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
the recursive self-similar structure of the Hilbert curve determines how patches are serialized, where long-range state shortcuts are inserted, and how positional relations are encoded... provides consistent neighborhood statistics across resolutions
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_strictMono_of_one_lt echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
FHSC selects one representative pair of spatially adjacent but sequentially distant sibling segments at each recursion level... E = union over l=1 to L of {(mid(S(1)_l), mid(S(4)_l))}
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
BERT: pre-training of deep bidirectional transformers for language understanding,
J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” inNAACL, 2019, pp. 4171–4186
work page 2019
-
[2]
Learning transferable visual models from natural language supervi- sion,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inICML, vol. 139. PMLR, 2021, pp. 8748–8763
work page 2021
-
[3]
J. Li, D. Li, C. Xiong, and S. C. H. Hoi, “BLIP: bootstrapping language- image pre-training for unified vision-language understanding and gen- eration,” inICML, ser. Proceedings of Machine Learning Research, vol
- [4]
-
[5]
Palm: Scaling language modeling with pathways,
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y . Tay, N. Shazeer, V . Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghe- mawat, S. De...
work page 2023
-
[6]
OpenAI, “GPT-4 technical report,”CoRR, vol. abs/2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,”CoRR, vol. abs/2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P. Huang, S. Li, I. Misra, M. G. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. J ´egou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features without superv...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Dollar, and R. Girshick, “Segment anything,” inICCV, October 2023, pp. 4015–4026. SUBMIT TO IEEE TRANSACTIONS ON MULTIMEDIA 10
work page 2023
-
[10]
J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,” inICML, vol. 202, 2023, pp. 19 730–19 742
work page 2023
-
[11]
Beta-tuned timestep diffusion model,
T. Zheng, P. Jiang, B. Wan, H. Zhang, J. Chen, J. Wang, and B. Li, “Beta-tuned timestep diffusion model,” inECCV (3), ser. Lecture Notes in Computer Science, vol. 15061. Springer, 2024, pp. 114–130
work page 2024
-
[12]
SAM 2: Segment Anything in Images and Videos
N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Doll ´ar, and C. Feichtenhofer, “Sam 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024. [Online]. Available: https://arxiv.org/abs/2408. 00714
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Z. Chen, W. Wang, H. Tian, S. Ye, Z. Gao, E. Cui, W. Tong, K. Hu, J. Luo, Z. Maet al., “How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites,”arXiv preprint arXiv:2404.16821, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Foundation models defining a new era in vision: a survey and outlook,
M. Awais, M. Naseer, S. Khan, R. M. Anwer, H. Cholakkal, M. Shah, M.-H. Yang, and F. S. Khan, “Foundation models defining a new era in vision: a survey and outlook,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[15]
Graph foundation models: Concepts, opportunities and challenges,
J. Liu, C. Yang, Z. Lu, J. Chen, Y . Li, M. Zhang, T. Bai, Y . Fang, L. Sun, P. S. Yuet al., “Graph foundation models: Concepts, opportunities and challenges,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[16]
Uni-moe: Scaling unified multimodal llms with mixture of experts,
Y . Li, S. Jiang, B. Hu, L. Wang, W. Zhong, W. Luo, L. Ma, and M. Zhang, “Uni-moe: Scaling unified multimodal llms with mixture of experts,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 5, pp. 3424–3439, 2025
work page 2025
-
[17]
LISA: reasoning segmentation via large language model,
X. Lai, Z. Tian, Y . Chen, Y . Li, Y . Yuan, S. Liu, and J. Jia, “LISA: reasoning segmentation via large language model,” inCVPR. IEEE, 2024, pp. 9579–9589
work page 2024
-
[18]
Towards training-free open-world segmentation via image prompt foundation models,
L. Tang, P. Jiang, H. Xiao, and B. Li, “Towards training-free open-world segmentation via image prompt foundation models,”Int. J. Comput. Vis., vol. 133, no. 1, pp. 1–15, 2025
work page 2025
-
[19]
X. Zhuang, Y . Xie, Y . Deng, D. Yang, L. Liang, J. Ru, Y . Yin, and Y . Zou, “Vargpt-v1.1: Improve visual autoregressive large unified model via iterative instruction tuning and reinforcement learning,”arXiv preprint arXiv:2504.02949, 2025
-
[20]
Advances in neural in- formation processing systems, 35:27730–27744
J. Pan, C. Liu, J. Wu, F. Liu, J. Zhu, H. B. Li, C. Chen, C. Ouyang, and D. Rueckert, “Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning,”CoRR, vol. abs/2502.19634, 2025
-
[21]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inNeurIPS, 2017, pp. 5998–6008
work page 2017
-
[22]
T. Dao and A. Gu, “Transformers are ssms: Generalized models and efficient algorithms through structured state space duality,” inICML. OpenReview.net, 2024
work page 2024
-
[23]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”CoRR, vol. abs/2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Localmamba: Visual state space model with windowed selective scan
T. Huang, X. Pei, S. You, F. Wang, C. Qian, and C. Xu, “Localmamba: Visual state space model with windowed selective scan,”CoRR, vol. abs/2403.09338, 2024
-
[25]
Plainmamba: Improving non- hierarchical mamba in visual recognition
C. Yang, Z. Chen, M. Espinosa, L. Ericsson, Z. Wang, J. Liu, and E. J. Crowley, “Plainmamba: Improving non-hierarchical mamba in visual recognition,”CoRR, vol. abs/2403.17695, 2024
-
[26]
Vision mamba: Efficient visual representation learning with bidirectional state space model,
L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,” inICML. OpenReview.net, 2024
work page 2024
-
[27]
Vmamba: Visual state space model,
Y . Liu, Y . Tian, Y . Zhao, H. Yu, L. Xie, Y . Wang, Q. Ye, J. Jiao, and Y . Liu, “Vmamba: Visual state space model,” inNeurIPS, 2024
work page 2024
-
[28]
Grootvl: Tree topology is all you need in state space model,
Y . Xiao, L. Song, S. Huang, J. Wang, S. Song, Y . Ge, X. Li, and Y . Shan, “Grootvl: Tree topology is all you need in state space model,”CoRR, vol. abs/2406.02395, 2024
-
[29]
Resformer: Scaling vits with multi-resolution training,
R. Tian, Z. Wu, Q. Dai, H. Hu, Y . Qiao, and Y . Jiang, “Resformer: Scaling vits with multi-resolution training,” inCVPR. IEEE, 2023, pp. 22 721–22 731
work page 2023
-
[30]
Demystify mamba in vision: A linear attention perspective,
D. Han, Z. Wang, Z. Xia, Y . Han, Y . Pu, C. Ge, J. Song, S. Song, B. Zheng, and G. Huang, “Demystify mamba in vision: A linear attention perspective,” inNeurIPS, 2024
work page 2024
-
[31]
Multi-scale vmamba: Hierarchy in hierarchy visual state space model,
Y . Shi, M. Dong, and C. Xu, “Multi-scale vmamba: Hierarchy in hierarchy visual state space model,” inNeurIPS, 2024
work page 2024
-
[32]
Spatial-mamba: Effective visual state space models via structure-aware state fusion,
C. Xiao, M. Li, Z. Zhang, D. Meng, and L. Zhang, “Spatial-mamba: Effective visual state space models via structure-aware state fusion,” CoRR, vol. abs/2410.15091, 2024
-
[33]
MambaVision: A hybrid mamba- transformer vision backbone,
A. Hatamizadeh and J. Kautz, “Mambavision: A hybrid mamba- transformer vision backbone,”CoRR, vol. abs/2407.08083, 2024
-
[34]
Mamba-r: Vision mamba ALSO needs registers,
F. Wang, J. Wang, S. Ren, G. Wei, J. Mei, W. Shao, Y . Zhou, A. L. Yuille, and C. Xie, “Mamba-r: Vision mamba ALSO needs registers,” CoRR, vol. abs/2405.14858, 2024
-
[35]
Efficientvmamba: Atrous selective scan for light weight visual mamba,
X. Pei, T. Huang, and C. Xu, “Efficientvmamba: Atrous selective scan for light weight visual mamba,” inAAAI. AAAI Press, 2025, pp. 6443– 6451
work page 2025
-
[36]
Boosting vision state space model with fractal scanning,
H. Xiao, L. Tang, P.-t. Jiang, H. Zhang, J. Chen, and B. Li, “Boosting vision state space model with fractal scanning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 8, 2025, pp. 8646–8654
work page 2025
-
[37]
Imagenet classification with deep convolutional neural networks,
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inNIPS, 2012, pp. 1106– 1114
work page 2012
-
[38]
Very deep convolutional networks for large-scale image recognition,
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” inICLR, 2015
work page 2015
-
[39]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inCVPR. IEEE, 2016, pp. 770–778
work page 2016
-
[40]
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,”CoRR, vol. abs/1704.04861, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[41]
Designing network design spaces,
I. Radosavovic, R. P. Kosaraju, R. B. Girshick, K. He, and P. Doll ´ar, “Designing network design spaces,” inCVPR. IEEE, 2020, pp. 10 425– 10 433
work page 2020
-
[42]
An image is worth 16x16 words: Transformers for image recognition at scale,
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inICLR. OpenReview.net, 2021
work page 2021
-
[43]
Swin transformer: Hierarchical vision transformer using shifted windows,
Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inICCV. IEEE, 2021, pp. 9992–10 002
work page 2021
-
[44]
H. Touvron, M. Cord, and H. J ´egou, “Deit III: revenge of the vit,” in ECCV, vol. 13684. Springer, 2022, pp. 516–533
work page 2022
-
[45]
Efficiently modeling long sequences with structured state spaces,
A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,” inICLR. OpenReview.net, 2022
work page 2022
-
[46]
Scalable autoregressive image generation with mamba,
H. Li, J. Yang, K. Wang, X. Qiu, Y . Chou, X. Li, and G. Li, “Scalable autoregressive image generation with mamba,”CoRR, vol. abs/2408.12245, 2024
-
[47]
Cobra: Extending mamba to multi-modal large language model for efficient inference,
H. Zhao, M. Zhang, W. Zhao, P. Ding, S. Huang, and D. Wang, “Cobra: Extending mamba to multi-modal large language model for efficient inference,” inAAAI. AAAI Press, 2025, pp. 10 421–10 429
work page 2025
-
[48]
Roformer: Enhanced transformer with rotary position embedding,
J. Su, M. H. M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,”Neurocomput- ing, vol. 568, p. 127063, 2024
work page 2024
-
[49]
Imagenet: A large-scale hierarchical image database,
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” inCVPR. IEEE, 2009, pp. 248–255
work page 2009
-
[50]
Decoupled Weight Decay Regularization
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[51]
Microsoft coco: Common objects in context,
T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inECCV. Springer, 2014, pp. 740–755
work page 2014
-
[52]
K. He, G. Gkioxari, P. Doll ´ar, and R. B. Girshick, “Mask R-CNN,” in ICCV. IEEE Computer Society, 2017, pp. 2980–2988
work page 2017
-
[53]
MMDetection: Open MMLab Detection Toolbox and Benchmark
K. Chen, J. Wang, J. Pang, Y . Cao, Y . Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xuet al., “Mmdetection: Open mmlab detection toolbox and benchmark,”arXiv preprint arXiv:1906.07155, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[54]
Defmamba: Deformable visual state space model,
L. Liu, M. Zhang, J. Yin, T. Liu, W. Ji, Y . Piao, and H. Lu, “Defmamba: Deformable visual state space model,”arXiv preprint arXiv:2504.05794, 2025
-
[55]
Semantic understanding of scenes through the ade20k dataset,
B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,”International Journal of Computer Vision, vol. 127, no. 3, pp. 302–321, 2019
work page 2019
-
[56]
Unified perceptual parsing for scene understanding,
T. Xiao, Y . Liu, B. Zhou, Y . Jiang, and J. Sun, “Unified perceptual parsing for scene understanding,” inECCV, 2018, pp. 418–434
work page 2018
-
[57]
Changemamba: Re- mote sensing change detection with spatio-temporal state space model. arxiv 2024,
H. Chen, J. Song, C. Han, J. Xia, and N. Yokoya, “Changemamba: Re- mote sensing change detection with spatio-temporal state space model. arxiv 2024,”arXiv preprint arXiv:2404.03425, 2024
-
[58]
Pyramid grafting network for one-stage high resolution saliency detection,
C. Xie, C. Xia, M. Ma, Z. Zhao, X. Chen, and J. Li, “Pyramid grafting network for one-stage high resolution saliency detection,” inCVPR. IEEE, 2022, pp. 11 707–11 716
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.