pith. machine review for the scientific record. sign in

arxiv: 2401.09417 · v3 · submitted 2024-01-17 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Bencheng Liao, Lianghui Zhu, Qian Zhang, Wenyu Liu, Xinggang Wang, Xinlong Wang

Pith reviewed 2026-05-11 21:29 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords vision backbonestate space modelMambaimage classificationobject detectionsemantic segmentationefficient vision model
0
0 comments X

The pith

A vision backbone built on bidirectional Mamba blocks outperforms DeiT transformers in accuracy and efficiency on image tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that self-attention is not essential for visual representation learning by replacing it with bidirectional state space models in a new backbone called Vim. Vim adds position embeddings to image sequences and uses bidirectional Mamba blocks to efficiently compress and represent the visual information. This leads to higher performance than DeiT on ImageNet classification, COCO detection, and ADE20K segmentation, with much better speed and memory efficiency especially for high resolution images. Sympathetic readers would care because it challenges the dominance of attention-based models and offers a path to more efficient vision systems that can handle larger images without excessive resource use.

Core claim

The central claim is that the reliance on self-attention for visual representation learning is not necessary. The authors propose Vim, a generic vision backbone with bidirectional Mamba blocks that marks image sequences with position embeddings and compresses the visual representation using bidirectional state space models. This achieves higher performance than DeiT on standard benchmarks while offering significant improvements in computation and memory efficiency.

What carries the argument

Bidirectional Mamba blocks, which process position-embedded image patch sequences with state space models to model visual data without attention layers.

If this is right

  • Vim achieves higher performance than DeiT on ImageNet classification.
  • Vim shows better results on COCO object detection and ADE20K semantic segmentation.
  • Vim is 2.8 times faster than DeiT and saves 86.8 percent of GPU memory for batch inference on 1248 by 1248 resolution images.
  • The approach removes computation and memory constraints for high-resolution visual understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If successful, similar bidirectional state space models could be applied to video processing where sequence lengths are even longer.
  • This might encourage development of hybrid architectures that combine Mamba blocks with other efficient components for vision.
  • The efficiency gains could allow training of larger vision models on limited hardware resources.

Load-bearing premise

Bidirectional state space models equipped with position embeddings can fully capture the position-sensitive aspects and global context needs of visual data without relying on self-attention.

What would settle it

A controlled experiment in which Vim trained under the same conditions as DeiT achieves lower accuracy on ImageNet or loses its efficiency advantage when extracting features from high-resolution images.

read the original abstract

Recently the state space models (SSMs) with efficient hardware-aware designs, i.e., the Mamba deep learning model, have shown great potential for long sequence modeling. Meanwhile building efficient and generic vision backbones purely upon SSMs is an appealing direction. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, we show that the reliance on self-attention for visual representation learning is not necessary and propose a new generic vision backbone with bidirectional Mamba blocks (Vim), which marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. On ImageNet classification, COCO object detection, and ADE20k semantic segmentation tasks, Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation & memory efficiency. For example, Vim is 2.8$\times$ faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248$\times$1248. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to be the next-generation backbone for vision foundation models. Code is available at https://github.com/hustvl/Vim.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Vision Mamba (Vim), a vision backbone that replaces self-attention with bidirectional Mamba blocks (selective state space models) applied to flattened image patch sequences augmented by position embeddings. It claims this architecture achieves higher accuracy than DeiT on ImageNet classification, COCO detection, and ADE20K segmentation while offering substantial efficiency gains, such as 2.8× faster inference and 86.8% lower GPU memory usage on high-resolution images.

Significance. If the results hold under rigorous verification, the work provides evidence that SSM-based models can serve as efficient, generic alternatives to vision transformers, with particular promise for high-resolution tasks where attention's quadratic cost is prohibitive. The public code release is a clear strength that supports reproducibility and future extensions.

major comments (3)
  1. [Experiments] Experimental Setup and Results sections: Training protocols, optimizer settings, data augmentations, and number of runs are not fully specified for the ImageNet, COCO, and ADE20K benchmarks. This omission prevents independent verification of the reported gains over DeiT and weakens support for the central claim that self-attention is unnecessary.
  2. [Method] §3 (Method, bidirectional Mamba blocks): The architecture flattens 2D patches into a 1D sequence, applies forward and backward selective SSM scans, and relies on learned positional encodings to restore position sensitivity. No ablation isolates the contribution of bidirectionality versus position embeddings, nor tests whether this indirect mechanism suffices for 2D neighborhood and diagonal relations that self-attention models explicitly.
  3. [Experiments] Results tables (e.g., ImageNet and downstream tasks): Performance deltas are presented without statistical significance tests, standard deviations across seeds, or controls for training compute. This makes it difficult to assess whether the efficiency-accuracy trade-off truly demonstrates that attention can be dispensed with.
minor comments (2)
  1. [Abstract] Abstract: The efficiency numbers (2.8× speed, 86.8% memory) are given for batch inference on 1248×1248 images; clarify whether these measurements include the full forward pass or only feature extraction.
  2. [Method] Figure 1 and architecture diagrams: The visualization of forward/backward scan paths on 2D patches would benefit from explicit annotation of how hidden states are merged across directions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment point-by-point below and commit to revisions that improve reproducibility and analysis without altering the core claims.

read point-by-point responses
  1. Referee: [Experiments] Experimental Setup and Results sections: Training protocols, optimizer settings, data augmentations, and number of runs are not fully specified for the ImageNet, COCO, and ADE20K benchmarks. This omission prevents independent verification of the reported gains over DeiT and weakens support for the central claim that self-attention is unnecessary.

    Authors: We agree that fuller documentation aids verification. The training protocols follow standard practices for DeiT comparisons and are implemented in the released code, but we will expand the Experimental Setup section in the revision to explicitly list optimizer settings (AdamW, lr=5e-4, weight decay 0.05, cosine decay), data augmentations (RandAugment, Mixup, CutMix as in DeiT), and number of runs (mean and std over 3 seeds). This directly addresses the concern for independent reproduction. revision: yes

  2. Referee: [Method] §3 (Method, bidirectional Mamba blocks): The architecture flattens 2D patches into a 1D sequence, applies forward and backward selective SSM scans, and relies on learned positional encodings to restore position sensitivity. No ablation isolates the contribution of bidirectionality versus position embeddings, nor tests whether this indirect mechanism suffices for 2D neighborhood and diagonal relations that self-attention models explicitly.

    Authors: We acknowledge the value of isolating these factors. The manuscript motivates bidirectionality for capturing global context in both directions and positional embeddings for spatial awareness, but does not include a dedicated ablation. In the revision we will add an ablation study comparing bidirectional vs. unidirectional Mamba blocks and the effect of removing positional embeddings. This will clarify the contribution to modeling 2D relations and strengthen the methodological justification. revision: yes

  3. Referee: [Experiments] Results tables (e.g., ImageNet and downstream tasks): Performance deltas are presented without statistical significance tests, standard deviations across seeds, or controls for training compute. This makes it difficult to assess whether the efficiency-accuracy trade-off truly demonstrates that attention can be dispensed with.

    Authors: We accept this criticism. The reported gains are consistent across three diverse tasks, but we did not include std devs or formal tests in the tables. In the revised manuscript we will add standard deviations from multiple seeds to the main results tables and include a brief discussion of statistical significance and training compute controls (all models trained under matched protocols). This will better support the efficiency claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical architecture proposal

full rationale

The paper proposes Vim as a vision backbone replacing self-attention with bidirectional Mamba blocks plus position embeddings, then validates via direct benchmark comparisons (ImageNet, COCO, ADE20K) against DeiT and other baselines. No equations, predictions, or first-principles results reduce to inputs by construction; the architecture is an explicit design choice whose sufficiency is tested externally and falsifiably. No self-citations are load-bearing for the central claim, and no fitted parameters are relabeled as predictions. This is a standard empirical contribution with independent content.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work is primarily empirical; it introduces no new physical entities or unproven mathematical axioms beyond standard deep-learning assumptions about sequence modeling.

free parameters (1)
  • model hyperparameters and architectural choices in Vim
    Standard tunable elements in neural network design that are fitted or selected to achieve reported performance.
axioms (1)
  • domain assumption Bidirectional state space models can represent visual data adequately when augmented with position embeddings
    Core premise invoked to justify replacing self-attention.

pith-pipeline@v0.9.0 · 5565 in / 1174 out tokens · 48424 ms · 2026-05-11T21:29:46.195528+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Selection, Not Fusion: Radar-Modulated State Space Models for Radar-Camera Depth Estimation

    cs.CV 2026-05 unverdicted novelty 7.0

    Radar-Modulated Selection perturbs only the step size Δ and readout C parameters inside Mamba's selective scan with radar data while keeping other components image-only, yielding state-of-the-art depth estimation on n...

  2. TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles

    cs.CV 2026-05 unverdicted novelty 7.0

    TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.

  3. Rethink MAE with Linear Time-Invariant Dynamics

    cs.CV 2026-04 unverdicted novelty 7.0

    Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.

  4. KAConvNet: Kolmogorov-Arnold Convolutional Networks for Vision Recognition

    cs.CV 2026-04 unverdicted novelty 7.0

    KAConvNet introduces a Kolmogorov-Arnold Convolutional Layer to build networks competitive with ViTs and CNNs while offering stronger theoretical interpretability.

  5. GraphLeap: Decoupling Graph Construction and Convolution for Vision GNN Acceleration on FPGA

    cs.CV 2026-04 conditional novelty 7.0

    GraphLeap decouples per-layer graph construction from feature updates in Vision GNNs by using previous-layer features for the current graph, enabling pipelined FPGA acceleration with up to 95.7× CPU speedup after fine-tuning.

  6. RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation

    cs.CV 2026-04 unverdicted novelty 7.0

    RSGMamba introduces a reliability-aware self-gated Mamba block for dynamic cross-modal feature selection in semantic segmentation, delivering state-of-the-art mIoU on RGB-D and RGB-T benchmarks with 48.6M parameters.

  7. WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects

    cs.CV 2026-04 unverdicted novelty 7.0

    WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.

  8. EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

    cs.CV 2026-05 unverdicted novelty 6.0

    EmambaIR is a visual state space model with cross-modal top-k sparse attention and gated SSM components that outperforms prior CNN and ViT methods on event-guided deblurring, deraining, and HDR reconstruction while re...

  9. GEM: Generating LiDAR World Model via Deformable Mamba

    cs.CV 2026-05 unverdicted novelty 6.0

    GEM is a new LiDAR world model using deformable Mamba that disentangles dynamic and static features to generate high-fidelity simulations and achieve state-of-the-art results on autonomous driving benchmarks.

  10. BVI-Mamba: Video Enhancement Using a Visual State-Space Model for Low-Light and Underwater Environments

    cs.CV 2026-04 unverdicted novelty 6.0

    BVI-Mamba enhances low-light and underwater videos by combining feature alignment with a UNet architecture built from Visual State Space blocks, claiming better quality and efficiency than prior Transformer or convolu...

  11. MambaBack: Bridging Local Features and Global Contexts in Whole Slide Image Analysis

    cs.CV 2026-04 conditional novelty 6.0

    MambaBack is a hybrid Mamba-CNN model with Hilbert sampling and chunked inference that reports better performance than seven prior methods on five whole-slide image datasets.

  12. HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet

    cs.CV 2026-04 unverdicted novelty 6.0

    HAMSA achieves 85.7% ImageNet-1K top-1 accuracy as a spectral-domain SSM with 2.2x faster inference and lower memory than transformers or scanning-based SSMs.

  13. Event-Adaptive State Transition and Gated Fusion for RGB-Event Object Tracking

    cs.CV 2026-04 unverdicted novelty 6.0

    MambaTrack improves RGB-Event object tracking via event-adaptive state transitions in a Dynamic State Space Model and a Gated Projection Fusion module, reporting state-of-the-art results on FE108 and FELT datasets.

  14. PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

    cs.CV 2026-04 unverdicted novelty 6.0

    PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.

  15. YOLOv12: Attention-Centric Real-Time Object Detectors

    cs.CV 2025-02 unverdicted novelty 6.0

    YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.

  16. ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs

    cs.CV 2026-05 unverdicted novelty 5.0

    ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.

  17. SAMIC: A Lightweight Semantic-Aware Mamba for Efficient Perceptual Image Compression

    cs.CV 2026-05 unverdicted novelty 5.0

    SAMIC introduces semantic-aware Mamba blocks and SVD-based redundancy reduction to achieve efficient perceptual image compression with improved rate-distortion-perception tradeoffs.

  18. TopoMamba: Topology-Aware Scanning and Fusion for Segmenting Heterogeneous Medical Visual Media

    cs.CV 2026-04 unverdicted novelty 5.0

    TopoMamba improves medical image segmentation by combining topology-aware diagonal scans with standard cross-scans and a HSIC Gate for efficient fusion, yielding gains on thin and curved targets like the pancreas.

  19. Breaking the Resource Wall: Geometry-Guided Sequence Modeling for Efficient Semantic Segmentation

    cs.CV 2026-04 unverdicted novelty 5.0

    DGM-Net reaches 82.3% mIoU on Cityscapes and 45.24% on ADE20K using directional geometric guidance inside a linear-complexity Mamba backbone, without heavy pretraining or large models.

  20. Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities

    cs.CV 2026-04 unverdicted novelty 5.0

    UniME combines a pretrained unified ViT encoder with modality-specific CNN encoders to improve brain tumor segmentation performance when some MRI modalities are missing.

  21. MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation

    cs.CV 2026-04 unverdicted novelty 5.0

    MambaLiteUNet integrates Mamba into U-Net with adaptive fusion, local-global mixing, and cross-gated attention modules to reach 87.12% IoU and 93.09% Dice on skin lesion datasets while cutting parameters by 93.6%.

  22. A Mamba-Based Multimodal Network for Multiscale Blast-Induced Rapid Structural Damage Assessment

    cs.AI 2026-04 unverdicted novelty 5.0

    A new Mamba multimodal network integrates multi-scale blast-loading information with satellite images to improve rapid structural damage assessment after explosions, showing gains over prior methods on the Beirut 2020 case.

  23. Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection

    cs.CV 2026-04 unverdicted novelty 5.0

    A new adapter module combining boundary-aware state space modeling with spatial processing boosts localization and robustness in temporal action detection.

  24. Beyond ZOH: Advanced Discretization Strategies for Vision Mamba

    cs.CV 2026-04 unverdicted novelty 4.0

    Bilinear discretization improves Vision Mamba accuracy over zero-order hold on classification, segmentation, and detection benchmarks with only modest extra training cost.

  25. ConvVitMamba: Efficient Multiscale Convolution, Transformer, and Mamba-Based Sequence modelling for Hyperspectral Image Classification

    cs.CV 2026-04 unverdicted novelty 4.0

    ConvVitMamba integrates multiscale convolution, transformer encoding, and Mamba-based refinement with PCA to outperform prior CNN, ViT, and Mamba methods in accuracy, size, and speed on four HSI benchmark datasets.

  26. Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection

    cs.CV 2026-04 unverdicted novelty 4.0

    MDDCNet combines Mamba blocks with deformable dilated convolutions, enhanced feed-forward networks, and an attention-aggregating feature pyramid to achieve better multi-scale traffic object detection than prior detectors.

  27. Attention Is not Everything: Efficient Alternatives for Vision

    cs.CV 2026-04 unverdicted novelty 3.0

    A survey that taxonomizes non-Transformer vision models and evaluates their practical trade-offs across efficiency, scalability, and robustness.

  28. A Hybrid Architecture for Benign-Malignant Classification of Mammography ROIs

    cs.CV 2026-04 unverdicted novelty 3.0

    Hybrid EfficientNetV2-M and Vision Mamba architecture achieves strong binary classification performance on abnormality-centered mammography ROIs from CBIS-DDSM.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · cited by 28 Pith papers · 9 internal anchors

  1. [1]

    Beit: BERT pre-training of image transformers

    Bao, H., Dong, L., Piao, S., and Wei, F. Beit: BERT pre-training of image transformers. In ICLR, 2022. URL https://openreview.net/forum?id=p-BhZSz59o4

  2. [2]

    2-d ssm: A general spatial layer for visual transformers

    Baron, E., Zimerman, I., and Wolf, L. 2-d ssm: A general spatial layer for visual transformers. arXiv preprint arXiv:2306.06635, 2023

  3. [3]

    Introducing our multimodal models, 2023

    Bavishi, R., Elsen, E., Hawthorne, C., Nye, M., Odena, A., Somani, A., and Ta s rlar, S. Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b

  4. [4]

    and Vasconcelos, N

    Cai, Z. and Vasconcelos, N. Cascade r-cnn: High quality object detection and instance segmentation. TPAMI, 2019

  5. [5]

    Emerging properties in self-supervised vision transformers

    Caron, M., Touvron, H., Misra, I., J \'e gou, H., Mairal, J., Bojanowski, P., and Joulin, A. Emerging properties in self-supervised vision transformers. In ICCV, 2021

  6. [6]

    Generating Long Sequences with Sparse Transformers

    Child, R., Gray, S., Radford, A., and Sutskever, I. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019

  7. [7]

    M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J

    Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., and Weller, A. Rethinking attention with performers. In ICLR, 2021. URL https://openreview.net/forum?id=Ua6zuk0WRH

  8. [8]

    V., and Tan, M

    Dai, Z., Liu, H., Le, Q. V., and Tan, M. Coatnet: Marrying convolution and attention for all data sizes. NeurIPS, 34, 2021

  9. [9]

    Imagenet: A large-scale hierarchical image database

    Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In CVPR, 2009

  10. [10]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  11. [11]

    LongNet: Scaling transformers to 1,000,000,000 tokens.arXiv preprint arXiv:2307.02486,

    Ding, J., Ma, S., Dong, L., Zhang, X., Huang, S., Wang, W., Zheng, N., and Wei, F. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486, 2023

  12. [12]

    Scaling up your kernels to 31x31: Revisiting large kernel design in cnns

    Ding, X., Zhang, X., Han, J., and Ding, G. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In CVPR, 2022

  13. [13]

    Cswin transformer: A general vision transformer backbone with cross-shaped windows

    Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., and Guo, B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR, 2022

  14. [14]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020

  15. [15]

    L., Morcos, A

    d’Ascoli, S., Touvron, H., Leavitt, M. L., Morcos, A. S., Biroli, G., and Sagun, L. Convit: Improving vision transformers with soft convolutional inductive biases. In ICML, 2021

  16. [16]

    Msg-transformer: Exchanging local spatial information by manipulating messenger tokens

    Fang, J., Xie, L., Wang, X., Zhang, X., Liu, W., and Tian, Q. Msg-transformer: Exchanging local spatial information by manipulating messenger tokens. In CVPR, 2022

  17. [17]

    Eva: Exploring the limits of masked visual representation learning at scale

    Fang, Y., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., Huang, T., Wang, X., and Cao, Y. Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023

  18. [18]

    Y., Dao, T., Saab, K

    Fu, D. Y., Dao, T., Saab, K. K., Thomas, A. W., Rudra, A., and Re, C. Hungry hungry hippos: Towards language modeling with state space models. In ICLR, 2023. URL https://openreview.net/forum?id=COZDy0WYGg

  19. [19]

    D., Le, Q

    Ghiasi, G., Cui, Y., Srinivas, A., Qian, R., Lin, T.-Y., Cubuk, E. D., Le, Q. V., and Zoph, B. Simple copy-paste is a strong data augmentation method for instance segmentation. In CVPR, 2021

  20. [20]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

  21. [21]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Gu, A., Goel, K., and R \'e , C. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021 a

  22. [22]

    Combining recurrent, convolutional, and continuous-time models with linear state space layers

    Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., and R \'e , C. Combining recurrent, convolutional, and continuous-time models with linear state space layers. In NeurIPS, 2021 b

  23. [23]

    On the parameterization and initialization of diagonal state space models

    Gu, A., Goel, K., Gupta, A., and R \'e , C. On the parameterization and initialization of diagonal state space models. In NeurIPS, 2022

  24. [24]

    Diagonal state spaces are as effective as structured state spaces

    Gupta, A., Gu, A., and Berant, J. Diagonal state spaces are as effective as structured state spaces. In NeurIPS, 2022

  25. [25]

    Deep residual learning for image recognition

    He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, 2016

  26. [26]

    Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In CVPR, 2017

  27. [27]

    Islam, M. M. and Bertasius, G. Long movie clip classification with state-space video models. In ECCV, 2022

  28. [28]

    M., Hasan, M., Athrey, K

    Islam, M. M., Hasan, M., Athrey, K. S., Braskich, T., and Bertasius, G. Efficient movie scene detection using state-space transformers. In CVPR, 2023

  29. [29]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q., Sung, Y.-H., Li, Z., and Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021

  30. [30]

    Kalman, R. E. A new approach to linear filtering and prediction problems. 1960

  31. [31]

    Kenton, J. D. M.-W. C. and Toutanova, L. K. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019

  32. [32]

    Reformer: The efficient transformer

    Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The efficient transformer. In ICLR, 2020. URL https://openreview.net/forum?id=rkgNKkHtvB

  33. [33]

    Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In NeurIPS, 2012

  34. [34]

    Gradient-based learning applied to document recognition

    LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86 0 (11): 0 2278--2324, 1998

  35. [35]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Li, J., Li, D., Xiong, C., and Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022 a

  36. [36]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023

  37. [37]

    What makes convolutional models great on long sequence modeling? In ICLR, 2022 b

    Li, Y., Cai, T., Zhang, Y., Chen, D., and Dey, D. What makes convolutional models great on long sequence modeling? In ICLR, 2022 b

  38. [38]

    Exploring plain vision transformer backbones for object detection

    Li, Y., Mao, H., Girshick, R., and He, K. Exploring plain vision transformer backbones for object detection. In ECCV, 2022 c

  39. [39]

    Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll \'a r, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In ECCV, 2014

  40. [40]

    Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023

  41. [41]

    More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity

    Liu, S., Chen, T., Chen, X., Chen, X., Xiao, Q., Wu, B., K \"a rkk \"a inen, T., Pechenizkiy, M., Mocanu, D., and Wang, Z. More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity. arXiv preprint arXiv:2207.03620, 2022 a

  42. [42]

    https:// arxiv.org/abs/2401.10166

    Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., and Liu, Y. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024

  43. [43]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021

  44. [44]

    A convnet for the 2020s

    Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., and Xie, S. A convnet for the 2020s. In CVPR, 2022 b

  45. [45]

    and Hutter, F

    Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In ICLR, 2019

  46. [46]

    U-mamba: Enhancing long-range dependency for biomedical image segmentation

    Ma, J., Li, F., and Wang, B. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024

  47. [47]

    Long range language modeling via gated state spaces

    Mehta, H., Gupta, A., Cutkosky, A., and Neyshabur, B. Long range language modeling via gated state spaces. In ICLR, 2023. URL https://openreview.net/forum?id=5MkYIYCbva

  48. [48]

    S4nd: Modeling images and videos as multidimensional signals with state spaces

    Nguyen, E., Goel, K., Gu, A., Downs, G., Shah, P., Dao, T., Baccus, S., and R \'e , C. S4nd: Modeling images and videos as multidimensional signals with state spaces. In NeurIPS, 2022

  49. [49]

    Hierarchically gated recurrent neural network for sequence modeling

    Qin, Z., Yang, S., and Zhong, Y. Hierarchically gated recurrent neural network for sequence modeling. In NeurIPS, 2023. URL https://openreview.net/forum?id=P1TCHxJwLB

  50. [50]

    W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al

    Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In ICML, 2021

  51. [51]

    P., Girshick, R., He, K., and Doll \'a r, P

    Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., and Doll \'a r, P. Designing network design spaces. In CVPR, 2020

  52. [52]

    Global filter networks for image classification

    Rao, Y., Zhao, W., Zhu, Z., Lu, J., and Zhou, J. Global filter networks for image classification. Advances in neural information processing systems, 34: 0 980--993, 2021

  53. [53]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014

  54. [54]

    T., De Mello, S., Kautz, J., Linderman, S., and Byeon, W

    Smith, J. T., De Mello, S., Kautz, J., Linderman, S., and Byeon, W. Convolutional state space models for long-range spatiotemporal modeling. In NeurIPS, 2023 a

  55. [55]

    T., Warrington, A., and Linderman, S

    Smith, J. T., Warrington, A., and Linderman, S. Simplified state space layers for sequence modeling. In ICLR, 2023 b . URL https://openreview.net/forum?id=Ai8Hw3AXqks

  56. [56]

    Segmenter: Transformer for semantic segmentation

    Strudel, R., Garcia, R., Laptev, I., and Schmid, C. Segmenter: Transformer for semantic segmentation. In ICCV, 2021

  57. [57]

    Retentive Network: A Successor to Transformer for Large Language Models

    Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., and Wei, F. Retentive network: A successor to transformer for large language modelss. arXiv preprint arXiv:2307.08621, 2023

  58. [58]

    Going deeper with convolutions

    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In CVPR, 2015

  59. [59]

    and Le, Q

    Tan, M. and Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019

  60. [60]

    and Le, Q

    Tan, M. and Le, Q. Efficientnetv2: Smaller models and faster training. In ICML, 2021

  61. [61]

    O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A., Keysers, D., Uszkoreit, J., et al

    Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A., Keysers, D., Uszkoreit, J., et al. Mlp-mixer: An all-mlp architecture for vision. In NeurIPS, 2021

  62. [62]

    Training data-efficient image transformers & distillation through attention

    Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and J \'e gou, H. Training data-efficient image transformers & distillation through attention. In ICML, 2021 a

  63. [63]

    Training data-efficient image transformers & distillation through attention

    Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and J \'e gou, H. Training data-efficient image transformers & distillation through attention. In ICML, 2021 b

  64. [64]

    Resmlp: Feedforward networks for image classification with data-efficient training

    Touvron, H., Bojanowski, P., Caron, M., Cord, M., El-Nouby, A., Grave, E., Izacard, G., Joulin, A., Synnaeve, G., Verbeek, J., et al. Resmlp: Feedforward networks for image classification with data-efficient training. TPAMI, 2022

  65. [65]

    Deep high-resolution representation learning for visual recognition

    Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., et al. Deep high-resolution representation learning for visual recognition. TPAMI, 2020 a

  66. [66]

    N., Gu, A., and Rush, A

    Wang, J., Yan, J. N., Gu, A., and Rush, A. M. Pretraining without attention. arXiv preprint arXiv:2212.10544, 2022

  67. [67]

    Selective structured state-spaces for long-form video understanding

    Wang, J., Zhu, W., Wang, P., Yu, X., Liu, L., Omar, M., and Hamid, R. Selective structured state-spaces for long-form video understanding. In CVPR, 2023 a

  68. [68]

    Linformer: Self-Attention with Linear Complexity

    Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020 b

  69. [69]

    Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

    Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., and Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021

  70. [70]

    Internimage: Exploring large-scale vision foundation models with deformable convolutions

    Wang, W., Dai, J., Chen, Z., Huang, Z., Li, Z., Zhu, X., Hu, X., Lu, T., Lu, L., Li, H., et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In CVPR, 2023 b

  71. [71]

    When an image is worth 1,024 x 1,024 words: A case study in computational pathology

    Wang, W., Ma, S., Xu, H., Usuyama, N., Ding, J., Poon, H., and Wei, F. When an image is worth 1,024 x 1,024 words: A case study in computational pathology. arXiv preprint arXiv:2312.03558, 2023 c

  72. [72]

    Cvt: Introducing convolutions to vision transformers

    Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., and Zhang, L. Cvt: Introducing convolutions to vision transformers. In ICCV, 2021

  73. [73]

    Unified perceptual parsing for scene understanding

    Xiao, T., Liu, Y., Zhou, B., Jiang, Y., and Sun, J. Unified perceptual parsing for scene understanding. In ECCV, 2018 a

  74. [74]

    Unified perceptual parsing for scene understanding

    Xiao, T., Liu, Y., Zhou, B., Jiang, Y., and Sun, J. Unified perceptual parsing for scene understanding. In ECCV, 2018 b

  75. [75]

    Aggregated residual transformations for deep neural networks

    Xie, S., Girshick, R., Doll \'a r, P., Tu, Z., and He, K. Aggregated residual transformations for deep neural networks. In CVPR, 2017

  76. [76]

    Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation

    Xing, Z., Ye, T., Yang, Y., Liu, G., and Zhu, L. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. arXiv preprint arXiv:2401.13560, 2024

  77. [77]

    N., Gu, J., and Rush, A

    Yan, J. N., Gu, J., and Rush, A. M. Diffusion models without attention. arXiv preprint arXiv:2311.18257, 2023

  78. [78]

    Focal self-attention for local-global interactions in vision transformers.arXiv preprint arXiv:2107.00641, 2021

    Yang, J., Li, C., Zhang, P., Dai, X., Xiao, B., Yuan, L., and Gao, J. Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641, 2021

  79. [79]

    Metaformer is actually what you need for vision

    Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., and Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10819--10829, 2022

  80. [80]

    Semantic understanding of scenes through the ade20k dataset

    Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., and Torralba, A. Semantic understanding of scenes through the ade20k dataset. IJCV, 2019