arxiv: 2401.09417 · v3 · submitted 2024-01-17 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Bencheng Liao, Lianghui Zhu, Qian Zhang, Wenyu Liu, Xinggang Wang, Xinlong Wang

Pith reviewed 2026-05-11 21:29 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords vision backbonestate space modelMambaimage classificationobject detectionsemantic segmentationefficient vision model

0 comments

The pith

A vision backbone built on bidirectional Mamba blocks outperforms DeiT transformers in accuracy and efficiency on image tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that self-attention is not essential for visual representation learning by replacing it with bidirectional state space models in a new backbone called Vim. Vim adds position embeddings to image sequences and uses bidirectional Mamba blocks to efficiently compress and represent the visual information. This leads to higher performance than DeiT on ImageNet classification, COCO detection, and ADE20K segmentation, with much better speed and memory efficiency especially for high resolution images. Sympathetic readers would care because it challenges the dominance of attention-based models and offers a path to more efficient vision systems that can handle larger images without excessive resource use.

Core claim

The central claim is that the reliance on self-attention for visual representation learning is not necessary. The authors propose Vim, a generic vision backbone with bidirectional Mamba blocks that marks image sequences with position embeddings and compresses the visual representation using bidirectional state space models. This achieves higher performance than DeiT on standard benchmarks while offering significant improvements in computation and memory efficiency.

What carries the argument

Bidirectional Mamba blocks, which process position-embedded image patch sequences with state space models to model visual data without attention layers.

If this is right

Vim achieves higher performance than DeiT on ImageNet classification.
Vim shows better results on COCO object detection and ADE20K semantic segmentation.
Vim is 2.8 times faster than DeiT and saves 86.8 percent of GPU memory for batch inference on 1248 by 1248 resolution images.
The approach removes computation and memory constraints for high-resolution visual understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If successful, similar bidirectional state space models could be applied to video processing where sequence lengths are even longer.
This might encourage development of hybrid architectures that combine Mamba blocks with other efficient components for vision.
The efficiency gains could allow training of larger vision models on limited hardware resources.

Load-bearing premise

Bidirectional state space models equipped with position embeddings can fully capture the position-sensitive aspects and global context needs of visual data without relying on self-attention.

What would settle it

A controlled experiment in which Vim trained under the same conditions as DeiT achieves lower accuracy on ImageNet or loses its efficiency advantage when extracting features from high-resolution images.

read the original abstract

Recently the state space models (SSMs) with efficient hardware-aware designs, i.e., the Mamba deep learning model, have shown great potential for long sequence modeling. Meanwhile building efficient and generic vision backbones purely upon SSMs is an appealing direction. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, we show that the reliance on self-attention for visual representation learning is not necessary and propose a new generic vision backbone with bidirectional Mamba blocks (Vim), which marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. On ImageNet classification, COCO object detection, and ADE20k semantic segmentation tasks, Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation & memory efficiency. For example, Vim is 2.8$\times$ faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248$\times$1248. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to be the next-generation backbone for vision foundation models. Code is available at https://github.com/hustvl/Vim.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Vim replaces attention with bidirectional Mamba plus positions and beats DeiT on accuracy and high-res efficiency, but the 2D spatial modeling via 1D scans remains the open question.

read the letter

The one or two things to know about this paper are that it presents a vision backbone called Vim built from bidirectional Mamba blocks with position embeddings, and it reports better performance than DeiT on ImageNet classification, COCO detection, and ADE20K segmentation, along with substantial efficiency improvements for high-resolution images. What is new is the application of bidirectional state space models to create a generic vision backbone. Previous work on Mamba focused on long sequences in other domains, so combining forward and backward scans with positional encodings for image patches is a non-trivial extension. The paper does well in showing concrete efficiency benefits. For instance, it is 2.8 times faster than DeiT and uses 86.8 percent less GPU memory during batch inference on 1248 by 1248 images. This addresses a practical limitation of transformers for high-res tasks and could influence choices in vision foundation models. Releasing the code is also helpful for reproducibility. The central claim is that self-attention is not necessary for visual representation learning. The empirical results support this on the tested benchmarks, as the model achieves higher accuracy while being more efficient. However, the soft spots lie in how well the architecture captures 2D spatial relations. The design flattens patches into a 1D sequence and applies sequential SSM scans in both directions, relying on the state and position embeddings to handle locality and global context. This indirect approach might not fully replicate the explicit all-pairs interactions of attention, and the paper would benefit from more detailed ablations on the bidirectional component and its effect on spatial modeling. The abstract lacks specifics on training protocols, ablations, and statistical significance, though the full paper presumably provides them. This work is for researchers exploring alternatives to vision transformers or efficient backbones for large images. A reader focused on practical efficiency gains or SSMs in vision would find it useful. It deserves a serious referee because the efficiency claims are sharp and the idea challenges the dominance of attention in a testable way. I would recommend putting this through peer review.

Referee Report

3 major / 2 minor

Summary. The paper proposes Vision Mamba (Vim), a vision backbone that replaces self-attention with bidirectional Mamba blocks (selective state space models) applied to flattened image patch sequences augmented by position embeddings. It claims this architecture achieves higher accuracy than DeiT on ImageNet classification, COCO detection, and ADE20K segmentation while offering substantial efficiency gains, such as 2.8× faster inference and 86.8% lower GPU memory usage on high-resolution images.

Significance. If the results hold under rigorous verification, the work provides evidence that SSM-based models can serve as efficient, generic alternatives to vision transformers, with particular promise for high-resolution tasks where attention's quadratic cost is prohibitive. The public code release is a clear strength that supports reproducibility and future extensions.

major comments (3)

[Experiments] Experimental Setup and Results sections: Training protocols, optimizer settings, data augmentations, and number of runs are not fully specified for the ImageNet, COCO, and ADE20K benchmarks. This omission prevents independent verification of the reported gains over DeiT and weakens support for the central claim that self-attention is unnecessary.
[Method] §3 (Method, bidirectional Mamba blocks): The architecture flattens 2D patches into a 1D sequence, applies forward and backward selective SSM scans, and relies on learned positional encodings to restore position sensitivity. No ablation isolates the contribution of bidirectionality versus position embeddings, nor tests whether this indirect mechanism suffices for 2D neighborhood and diagonal relations that self-attention models explicitly.
[Experiments] Results tables (e.g., ImageNet and downstream tasks): Performance deltas are presented without statistical significance tests, standard deviations across seeds, or controls for training compute. This makes it difficult to assess whether the efficiency-accuracy trade-off truly demonstrates that attention can be dispensed with.

minor comments (2)

[Abstract] Abstract: The efficiency numbers (2.8× speed, 86.8% memory) are given for batch inference on 1248×1248 images; clarify whether these measurements include the full forward pass or only feature extraction.
[Method] Figure 1 and architecture diagrams: The visualization of forward/backward scan paths on 2D patches would benefit from explicit annotation of how hidden states are merged across directions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment point-by-point below and commit to revisions that improve reproducibility and analysis without altering the core claims.

read point-by-point responses

Referee: [Experiments] Experimental Setup and Results sections: Training protocols, optimizer settings, data augmentations, and number of runs are not fully specified for the ImageNet, COCO, and ADE20K benchmarks. This omission prevents independent verification of the reported gains over DeiT and weakens support for the central claim that self-attention is unnecessary.

Authors: We agree that fuller documentation aids verification. The training protocols follow standard practices for DeiT comparisons and are implemented in the released code, but we will expand the Experimental Setup section in the revision to explicitly list optimizer settings (AdamW, lr=5e-4, weight decay 0.05, cosine decay), data augmentations (RandAugment, Mixup, CutMix as in DeiT), and number of runs (mean and std over 3 seeds). This directly addresses the concern for independent reproduction. revision: yes
Referee: [Method] §3 (Method, bidirectional Mamba blocks): The architecture flattens 2D patches into a 1D sequence, applies forward and backward selective SSM scans, and relies on learned positional encodings to restore position sensitivity. No ablation isolates the contribution of bidirectionality versus position embeddings, nor tests whether this indirect mechanism suffices for 2D neighborhood and diagonal relations that self-attention models explicitly.

Authors: We acknowledge the value of isolating these factors. The manuscript motivates bidirectionality for capturing global context in both directions and positional embeddings for spatial awareness, but does not include a dedicated ablation. In the revision we will add an ablation study comparing bidirectional vs. unidirectional Mamba blocks and the effect of removing positional embeddings. This will clarify the contribution to modeling 2D relations and strengthen the methodological justification. revision: yes
Referee: [Experiments] Results tables (e.g., ImageNet and downstream tasks): Performance deltas are presented without statistical significance tests, standard deviations across seeds, or controls for training compute. This makes it difficult to assess whether the efficiency-accuracy trade-off truly demonstrates that attention can be dispensed with.

Authors: We accept this criticism. The reported gains are consistent across three diverse tasks, but we did not include std devs or formal tests in the tables. In the revised manuscript we will add standard deviations from multiple seeds to the main results tables and include a brief discussion of statistical significance and training compute controls (all models trained under matched protocols). This will better support the efficiency claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical architecture proposal

full rationale

The paper proposes Vim as a vision backbone replacing self-attention with bidirectional Mamba blocks plus position embeddings, then validates via direct benchmark comparisons (ImageNet, COCO, ADE20K) against DeiT and other baselines. No equations, predictions, or first-principles results reduce to inputs by construction; the architecture is an explicit design choice whose sufficiency is tested externally and falsifiably. No self-citations are load-bearing for the central claim, and no fitted parameters are relabeled as predictions. This is a standard empirical contribution with independent content.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work is primarily empirical; it introduces no new physical entities or unproven mathematical axioms beyond standard deep-learning assumptions about sequence modeling.

free parameters (1)

model hyperparameters and architectural choices in Vim
Standard tunable elements in neural network design that are fitted or selected to achieve reported performance.

axioms (1)

domain assumption Bidirectional state space models can represent visual data adequately when augmented with position embeddings
Core premise invoked to justify replacing self-attention.

pith-pipeline@v0.9.0 · 5565 in / 1174 out tokens · 48424 ms · 2026-05-11T21:29:46.195528+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear
we show that the reliance on self-attention for visual representation learning is not necessary and propose a new generic vision backbone with bidirectional Mamba blocks (Vim), which marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models
IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear
Vim is 2.8× faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248×1248

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Selection, Not Fusion: Radar-Modulated State Space Models for Radar-Camera Depth Estimation
cs.CV 2026-05 unverdicted novelty 7.0

Radar-Modulated Selection perturbs only the step size Δ and readout C parameters inside Mamba's selective scan with radar data while keeping other components image-only, yielding state-of-the-art depth estimation on n...
TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles
cs.CV 2026-05 unverdicted novelty 7.0

TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
Rethink MAE with Linear Time-Invariant Dynamics
cs.CV 2026-04 unverdicted novelty 7.0

Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.
KAConvNet: Kolmogorov-Arnold Convolutional Networks for Vision Recognition
cs.CV 2026-04 unverdicted novelty 7.0

KAConvNet introduces a Kolmogorov-Arnold Convolutional Layer to build networks competitive with ViTs and CNNs while offering stronger theoretical interpretability.
GraphLeap: Decoupling Graph Construction and Convolution for Vision GNN Acceleration on FPGA
cs.CV 2026-04 conditional novelty 7.0

GraphLeap decouples per-layer graph construction from feature updates in Vision GNNs by using previous-layer features for the current graph, enabling pipelined FPGA acceleration with up to 95.7× CPU speedup after fine-tuning.
RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

RSGMamba introduces a reliability-aware self-gated Mamba block for dynamic cross-modal feature selection in semantic segmentation, delivering state-of-the-art mIoU on RGB-D and RGB-T benchmarks with 48.6M parameters.
WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects
cs.CV 2026-04 unverdicted novelty 7.0

WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.
EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction
cs.CV 2026-05 unverdicted novelty 6.0

EmambaIR is a visual state space model with cross-modal top-k sparse attention and gated SSM components that outperforms prior CNN and ViT methods on event-guided deblurring, deraining, and HDR reconstruction while re...
GEM: Generating LiDAR World Model via Deformable Mamba
cs.CV 2026-05 unverdicted novelty 6.0

GEM is a new LiDAR world model using deformable Mamba that disentangles dynamic and static features to generate high-fidelity simulations and achieve state-of-the-art results on autonomous driving benchmarks.
BVI-Mamba: Video Enhancement Using a Visual State-Space Model for Low-Light and Underwater Environments
cs.CV 2026-04 unverdicted novelty 6.0

BVI-Mamba enhances low-light and underwater videos by combining feature alignment with a UNet architecture built from Visual State Space blocks, claiming better quality and efficiency than prior Transformer or convolu...
MambaBack: Bridging Local Features and Global Contexts in Whole Slide Image Analysis
cs.CV 2026-04 conditional novelty 6.0

MambaBack is a hybrid Mamba-CNN model with Hilbert sampling and chunked inference that reports better performance than seven prior methods on five whole-slide image datasets.
HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet
cs.CV 2026-04 unverdicted novelty 6.0

HAMSA achieves 85.7% ImageNet-1K top-1 accuracy as a spectral-domain SSM with 2.2x faster inference and lower memory than transformers or scanning-based SSMs.
Event-Adaptive State Transition and Gated Fusion for RGB-Event Object Tracking
cs.CV 2026-04 unverdicted novelty 6.0

MambaTrack improves RGB-Event object tracking via event-adaptive state transitions in a Dynamic State Space Model and a Gated Projection Fusion module, reporting state-of-the-art results on FE108 and FELT datasets.
PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer
cs.CV 2026-04 unverdicted novelty 6.0

PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.
YOLOv12: Attention-Centric Real-Time Object Detectors
cs.CV 2025-02 unverdicted novelty 6.0

YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.
ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs
cs.CV 2026-05 unverdicted novelty 5.0

ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.
SAMIC: A Lightweight Semantic-Aware Mamba for Efficient Perceptual Image Compression
cs.CV 2026-05 unverdicted novelty 5.0

SAMIC introduces semantic-aware Mamba blocks and SVD-based redundancy reduction to achieve efficient perceptual image compression with improved rate-distortion-perception tradeoffs.
TopoMamba: Topology-Aware Scanning and Fusion for Segmenting Heterogeneous Medical Visual Media
cs.CV 2026-04 unverdicted novelty 5.0

TopoMamba improves medical image segmentation by combining topology-aware diagonal scans with standard cross-scans and a HSIC Gate for efficient fusion, yielding gains on thin and curved targets like the pancreas.
Breaking the Resource Wall: Geometry-Guided Sequence Modeling for Efficient Semantic Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

DGM-Net reaches 82.3% mIoU on Cityscapes and 45.24% on ADE20K using directional geometric guidance inside a linear-complexity Mamba backbone, without heavy pretraining or large models.
Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities
cs.CV 2026-04 unverdicted novelty 5.0

UniME combines a pretrained unified ViT encoder with modality-specific CNN encoders to improve brain tumor segmentation performance when some MRI modalities are missing.
MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

MambaLiteUNet integrates Mamba into U-Net with adaptive fusion, local-global mixing, and cross-gated attention modules to reach 87.12% IoU and 93.09% Dice on skin lesion datasets while cutting parameters by 93.6%.
A Mamba-Based Multimodal Network for Multiscale Blast-Induced Rapid Structural Damage Assessment
cs.AI 2026-04 unverdicted novelty 5.0

A new Mamba multimodal network integrates multi-scale blast-loading information with satellite images to improve rapid structural damage assessment after explosions, showing gains over prior methods on the Beirut 2020 case.
Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection
cs.CV 2026-04 unverdicted novelty 5.0

A new adapter module combining boundary-aware state space modeling with spatial processing boosts localization and robustness in temporal action detection.
Beyond ZOH: Advanced Discretization Strategies for Vision Mamba
cs.CV 2026-04 unverdicted novelty 4.0

Bilinear discretization improves Vision Mamba accuracy over zero-order hold on classification, segmentation, and detection benchmarks with only modest extra training cost.
ConvVitMamba: Efficient Multiscale Convolution, Transformer, and Mamba-Based Sequence modelling for Hyperspectral Image Classification
cs.CV 2026-04 unverdicted novelty 4.0

ConvVitMamba integrates multiscale convolution, transformer encoding, and Mamba-based refinement with PCA to outperform prior CNN, ViT, and Mamba methods in accuracy, size, and speed on four HSI benchmark datasets.
Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection
cs.CV 2026-04 unverdicted novelty 4.0

MDDCNet combines Mamba blocks with deformable dilated convolutions, enhanced feed-forward networks, and an attention-aggregating feature pyramid to achieve better multi-scale traffic object detection than prior detectors.
Attention Is not Everything: Efficient Alternatives for Vision
cs.CV 2026-04 unverdicted novelty 3.0

A survey that taxonomizes non-Transformer vision models and evaluates their practical trade-offs across efficiency, scalability, and robustness.
A Hybrid Architecture for Benign-Malignant Classification of Mammography ROIs
cs.CV 2026-04 unverdicted novelty 3.0

Hybrid EfficientNetV2-M and Vision Mamba architecture achieves strong binary classification performance on abnormality-centered mammography ROIs from CBIS-DDSM.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · cited by 28 Pith papers · 9 internal anchors

[1]

Beit: BERT pre-training of image transformers

Bao, H., Dong, L., Piao, S., and Wei, F. Beit: BERT pre-training of image transformers. In ICLR, 2022. URL https://openreview.net/forum?id=p-BhZSz59o4

work page 2022
[2]

2-d ssm: A general spatial layer for visual transformers

Baron, E., Zimerman, I., and Wolf, L. 2-d ssm: A general spatial layer for visual transformers. arXiv preprint arXiv:2306.06635, 2023

work page arXiv 2023
[3]

Introducing our multimodal models, 2023

Bavishi, R., Elsen, E., Hawthorne, C., Nye, M., Odena, A., Somani, A., and Ta s rlar, S. Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b

work page 2023
[4]

and Vasconcelos, N

Cai, Z. and Vasconcelos, N. Cascade r-cnn: High quality object detection and instance segmentation. TPAMI, 2019

work page 2019
[5]

Emerging properties in self-supervised vision transformers

Caron, M., Touvron, H., Misra, I., J \'e gou, H., Mairal, J., Bojanowski, P., and Joulin, A. Emerging properties in self-supervised vision transformers. In ICCV, 2021

work page 2021
[6]

Generating Long Sequences with Sparse Transformers

Child, R., Gray, S., Radford, A., and Sutskever, I. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[7]

M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J

Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., and Weller, A. Rethinking attention with performers. In ICLR, 2021. URL https://openreview.net/forum?id=Ua6zuk0WRH

work page 2021
[8]

V., and Tan, M

Dai, Z., Liu, H., Le, Q. V., and Tan, M. Coatnet: Marrying convolution and attention for all data sizes. NeurIPS, 34, 2021

work page 2021
[9]

Imagenet: A large-scale hierarchical image database

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In CVPR, 2009

work page 2009
[10]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

LongNet: Scaling transformers to 1,000,000,000 tokens.arXiv preprint arXiv:2307.02486,

Ding, J., Ma, S., Dong, L., Zhang, X., Huang, S., Wang, W., Zheng, N., and Wei, F. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486, 2023

work page arXiv 2023
[12]

Scaling up your kernels to 31x31: Revisiting large kernel design in cnns

Ding, X., Zhang, X., Han, J., and Ding, G. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In CVPR, 2022

work page 2022
[13]

Cswin transformer: A general vision transformer backbone with cross-shaped windows

Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., and Guo, B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR, 2022

work page 2022
[14]

An image is worth 16x16 words: Transformers for image recognition at scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020

work page 2020
[15]

L., Morcos, A

d’Ascoli, S., Touvron, H., Leavitt, M. L., Morcos, A. S., Biroli, G., and Sagun, L. Convit: Improving vision transformers with soft convolutional inductive biases. In ICML, 2021

work page 2021
[16]

Msg-transformer: Exchanging local spatial information by manipulating messenger tokens

Fang, J., Xie, L., Wang, X., Zhang, X., Liu, W., and Tian, Q. Msg-transformer: Exchanging local spatial information by manipulating messenger tokens. In CVPR, 2022

work page 2022
[17]

Eva: Exploring the limits of masked visual representation learning at scale

Fang, Y., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., Huang, T., Wang, X., and Cao, Y. Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023

work page 2023
[18]

Y., Dao, T., Saab, K

Fu, D. Y., Dao, T., Saab, K. K., Thomas, A. W., Rudra, A., and Re, C. Hungry hungry hippos: Towards language modeling with state space models. In ICLR, 2023. URL https://openreview.net/forum?id=COZDy0WYGg

work page 2023
[19]

D., Le, Q

Ghiasi, G., Cui, Y., Srinivas, A., Qian, R., Lin, T.-Y., Cubuk, E. D., Le, Q. V., and Zoph, B. Simple copy-paste is a strong data augmentation method for instance segmentation. In CVPR, 2021

work page 2021
[20]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Efficiently Modeling Long Sequences with Structured State Spaces

Gu, A., Goel, K., and R \'e , C. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021 a

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

Combining recurrent, convolutional, and continuous-time models with linear state space layers

Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., and R \'e , C. Combining recurrent, convolutional, and continuous-time models with linear state space layers. In NeurIPS, 2021 b

work page 2021
[23]

On the parameterization and initialization of diagonal state space models

Gu, A., Goel, K., Gupta, A., and R \'e , C. On the parameterization and initialization of diagonal state space models. In NeurIPS, 2022

work page 2022
[24]

Diagonal state spaces are as effective as structured state spaces

Gupta, A., Gu, A., and Berant, J. Diagonal state spaces are as effective as structured state spaces. In NeurIPS, 2022

work page 2022
[25]

Deep residual learning for image recognition

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, 2016

work page 2016
[26]

Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In CVPR, 2017

work page 2017
[27]

Islam, M. M. and Bertasius, G. Long movie clip classification with state-space video models. In ECCV, 2022

work page 2022
[28]

M., Hasan, M., Athrey, K

Islam, M. M., Hasan, M., Athrey, K. S., Braskich, T., and Bertasius, G. Efficient movie scene detection using state-space transformers. In CVPR, 2023

work page 2023
[29]

Scaling up visual and vision-language representation learning with noisy text supervision

Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q., Sung, Y.-H., Li, Z., and Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021

work page 2021
[30]

Kalman, R. E. A new approach to linear filtering and prediction problems. 1960

work page 1960
[31]

Kenton, J. D. M.-W. C. and Toutanova, L. K. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019

work page 2019
[32]

Reformer: The efficient transformer

Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The efficient transformer. In ICLR, 2020. URL https://openreview.net/forum?id=rkgNKkHtvB

work page 2020
[33]

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In NeurIPS, 2012

work page 2012
[34]

Gradient-based learning applied to document recognition

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86 0 (11): 0 2278--2324, 1998

work page 1998
[35]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Li, J., Li, D., Xiong, C., and Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022 a

work page 2022
[36]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023

work page internal anchor Pith review arXiv 2023
[37]

What makes convolutional models great on long sequence modeling? In ICLR, 2022 b

Li, Y., Cai, T., Zhang, Y., Chen, D., and Dey, D. What makes convolutional models great on long sequence modeling? In ICLR, 2022 b

work page 2022
[38]

Exploring plain vision transformer backbones for object detection

Li, Y., Mao, H., Girshick, R., and He, K. Exploring plain vision transformer backbones for object detection. In ECCV, 2022 c

work page 2022
[39]

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll \'a r, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In ECCV, 2014

work page 2014
[40]

Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity

Liu, S., Chen, T., Chen, X., Chen, X., Xiao, Q., Wu, B., K \"a rkk \"a inen, T., Pechenizkiy, M., Mocanu, D., and Wang, Z. More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity. arXiv preprint arXiv:2207.03620, 2022 a

work page arXiv 2022
[42]

https:// arxiv.org/abs/2401.10166

Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., and Liu, Y. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024

work page arXiv 2024
[43]

Swin transformer: Hierarchical vision transformer using shifted windows

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021

work page 2021
[44]

A convnet for the 2020s

Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., and Xie, S. A convnet for the 2020s. In CVPR, 2022 b

work page 2022
[45]

and Hutter, F

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In ICLR, 2019

work page 2019
[46]

U-mamba: Enhancing long-range dependency for biomedical image segmentation

Ma, J., Li, F., and Wang, B. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024

work page arXiv 2024
[47]

Long range language modeling via gated state spaces

Mehta, H., Gupta, A., Cutkosky, A., and Neyshabur, B. Long range language modeling via gated state spaces. In ICLR, 2023. URL https://openreview.net/forum?id=5MkYIYCbva

work page 2023
[48]

S4nd: Modeling images and videos as multidimensional signals with state spaces

Nguyen, E., Goel, K., Gu, A., Downs, G., Shah, P., Dao, T., Baccus, S., and R \'e , C. S4nd: Modeling images and videos as multidimensional signals with state spaces. In NeurIPS, 2022

work page 2022
[49]

Hierarchically gated recurrent neural network for sequence modeling

Qin, Z., Yang, S., and Zhong, Y. Hierarchically gated recurrent neural network for sequence modeling. In NeurIPS, 2023. URL https://openreview.net/forum?id=P1TCHxJwLB

work page 2023
[50]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In ICML, 2021

work page 2021
[51]

P., Girshick, R., He, K., and Doll \'a r, P

Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., and Doll \'a r, P. Designing network design spaces. In CVPR, 2020

work page 2020
[52]

Global filter networks for image classification

Rao, Y., Zhao, W., Zhu, Z., Lu, J., and Zhou, J. Global filter networks for image classification. Advances in neural information processing systems, 34: 0 980--993, 2021

work page 2021
[53]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[54]

T., De Mello, S., Kautz, J., Linderman, S., and Byeon, W

Smith, J. T., De Mello, S., Kautz, J., Linderman, S., and Byeon, W. Convolutional state space models for long-range spatiotemporal modeling. In NeurIPS, 2023 a

work page 2023
[55]

T., Warrington, A., and Linderman, S

Smith, J. T., Warrington, A., and Linderman, S. Simplified state space layers for sequence modeling. In ICLR, 2023 b . URL https://openreview.net/forum?id=Ai8Hw3AXqks

work page 2023
[56]

Segmenter: Transformer for semantic segmentation

Strudel, R., Garcia, R., Laptev, I., and Schmid, C. Segmenter: Transformer for semantic segmentation. In ICCV, 2021

work page 2021
[57]

Retentive Network: A Successor to Transformer for Large Language Models

Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., and Wei, F. Retentive network: A successor to transformer for large language modelss. arXiv preprint arXiv:2307.08621, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Going deeper with convolutions

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In CVPR, 2015

work page 2015
[59]

and Le, Q

Tan, M. and Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019

work page 2019
[60]

and Le, Q

Tan, M. and Le, Q. Efficientnetv2: Smaller models and faster training. In ICML, 2021

work page 2021
[61]

O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A., Keysers, D., Uszkoreit, J., et al

Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A., Keysers, D., Uszkoreit, J., et al. Mlp-mixer: An all-mlp architecture for vision. In NeurIPS, 2021

work page 2021
[62]

Training data-efficient image transformers & distillation through attention

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and J \'e gou, H. Training data-efficient image transformers & distillation through attention. In ICML, 2021 a

work page 2021
[63]

Training data-efficient image transformers & distillation through attention

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and J \'e gou, H. Training data-efficient image transformers & distillation through attention. In ICML, 2021 b

work page 2021
[64]

Resmlp: Feedforward networks for image classification with data-efficient training

Touvron, H., Bojanowski, P., Caron, M., Cord, M., El-Nouby, A., Grave, E., Izacard, G., Joulin, A., Synnaeve, G., Verbeek, J., et al. Resmlp: Feedforward networks for image classification with data-efficient training. TPAMI, 2022

work page 2022
[65]

Deep high-resolution representation learning for visual recognition

Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., et al. Deep high-resolution representation learning for visual recognition. TPAMI, 2020 a

work page 2020
[66]

N., Gu, A., and Rush, A

Wang, J., Yan, J. N., Gu, A., and Rush, A. M. Pretraining without attention. arXiv preprint arXiv:2212.10544, 2022

work page arXiv 2022
[67]

Selective structured state-spaces for long-form video understanding

Wang, J., Zhu, W., Wang, P., Yu, X., Liu, L., Omar, M., and Hamid, R. Selective structured state-spaces for long-form video understanding. In CVPR, 2023 a

work page 2023
[68]

Linformer: Self-Attention with Linear Complexity

Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020 b

work page internal anchor Pith review arXiv 2006
[69]

Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., and Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021

work page 2021
[70]

Internimage: Exploring large-scale vision foundation models with deformable convolutions

Wang, W., Dai, J., Chen, Z., Huang, Z., Li, Z., Zhu, X., Hu, X., Lu, T., Lu, L., Li, H., et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In CVPR, 2023 b

work page 2023
[71]

When an image is worth 1,024 x 1,024 words: A case study in computational pathology

Wang, W., Ma, S., Xu, H., Usuyama, N., Ding, J., Poon, H., and Wei, F. When an image is worth 1,024 x 1,024 words: A case study in computational pathology. arXiv preprint arXiv:2312.03558, 2023 c

work page arXiv 2023
[72]

Cvt: Introducing convolutions to vision transformers

Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., and Zhang, L. Cvt: Introducing convolutions to vision transformers. In ICCV, 2021

work page 2021
[73]

Unified perceptual parsing for scene understanding

Xiao, T., Liu, Y., Zhou, B., Jiang, Y., and Sun, J. Unified perceptual parsing for scene understanding. In ECCV, 2018 a

work page 2018
[74]

Unified perceptual parsing for scene understanding

Xiao, T., Liu, Y., Zhou, B., Jiang, Y., and Sun, J. Unified perceptual parsing for scene understanding. In ECCV, 2018 b

work page 2018
[75]

Aggregated residual transformations for deep neural networks

Xie, S., Girshick, R., Doll \'a r, P., Tu, Z., and He, K. Aggregated residual transformations for deep neural networks. In CVPR, 2017

work page 2017
[76]

Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation

Xing, Z., Ye, T., Yang, Y., Liu, G., and Zhu, L. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. arXiv preprint arXiv:2401.13560, 2024

work page arXiv 2024
[77]

N., Gu, J., and Rush, A

Yan, J. N., Gu, J., and Rush, A. M. Diffusion models without attention. arXiv preprint arXiv:2311.18257, 2023

work page arXiv 2023
[78]

Focal self-attention for local-global interactions in vision transformers.arXiv preprint arXiv:2107.00641, 2021

Yang, J., Li, C., Zhang, P., Dai, X., Xiao, B., Yuan, L., and Gao, J. Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641, 2021

work page arXiv 2021
[79]

Metaformer is actually what you need for vision

Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., and Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10819--10829, 2022

work page 2022
[80]

Semantic understanding of scenes through the ade20k dataset

Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., and Torralba, A. Semantic understanding of scenes through the ade20k dataset. IJCV, 2019

work page 2019