Recognition: 2 theorem links
· Lean TheoremVision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Pith reviewed 2026-05-11 21:29 UTC · model grok-4.3
The pith
A vision backbone built on bidirectional Mamba blocks outperforms DeiT transformers in accuracy and efficiency on image tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the reliance on self-attention for visual representation learning is not necessary. The authors propose Vim, a generic vision backbone with bidirectional Mamba blocks that marks image sequences with position embeddings and compresses the visual representation using bidirectional state space models. This achieves higher performance than DeiT on standard benchmarks while offering significant improvements in computation and memory efficiency.
What carries the argument
Bidirectional Mamba blocks, which process position-embedded image patch sequences with state space models to model visual data without attention layers.
If this is right
- Vim achieves higher performance than DeiT on ImageNet classification.
- Vim shows better results on COCO object detection and ADE20K semantic segmentation.
- Vim is 2.8 times faster than DeiT and saves 86.8 percent of GPU memory for batch inference on 1248 by 1248 resolution images.
- The approach removes computation and memory constraints for high-resolution visual understanding.
Where Pith is reading between the lines
- If successful, similar bidirectional state space models could be applied to video processing where sequence lengths are even longer.
- This might encourage development of hybrid architectures that combine Mamba blocks with other efficient components for vision.
- The efficiency gains could allow training of larger vision models on limited hardware resources.
Load-bearing premise
Bidirectional state space models equipped with position embeddings can fully capture the position-sensitive aspects and global context needs of visual data without relying on self-attention.
What would settle it
A controlled experiment in which Vim trained under the same conditions as DeiT achieves lower accuracy on ImageNet or loses its efficiency advantage when extracting features from high-resolution images.
read the original abstract
Recently the state space models (SSMs) with efficient hardware-aware designs, i.e., the Mamba deep learning model, have shown great potential for long sequence modeling. Meanwhile building efficient and generic vision backbones purely upon SSMs is an appealing direction. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, we show that the reliance on self-attention for visual representation learning is not necessary and propose a new generic vision backbone with bidirectional Mamba blocks (Vim), which marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. On ImageNet classification, COCO object detection, and ADE20k semantic segmentation tasks, Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation & memory efficiency. For example, Vim is 2.8$\times$ faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248$\times$1248. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to be the next-generation backbone for vision foundation models. Code is available at https://github.com/hustvl/Vim.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Vision Mamba (Vim), a vision backbone that replaces self-attention with bidirectional Mamba blocks (selective state space models) applied to flattened image patch sequences augmented by position embeddings. It claims this architecture achieves higher accuracy than DeiT on ImageNet classification, COCO detection, and ADE20K segmentation while offering substantial efficiency gains, such as 2.8× faster inference and 86.8% lower GPU memory usage on high-resolution images.
Significance. If the results hold under rigorous verification, the work provides evidence that SSM-based models can serve as efficient, generic alternatives to vision transformers, with particular promise for high-resolution tasks where attention's quadratic cost is prohibitive. The public code release is a clear strength that supports reproducibility and future extensions.
major comments (3)
- [Experiments] Experimental Setup and Results sections: Training protocols, optimizer settings, data augmentations, and number of runs are not fully specified for the ImageNet, COCO, and ADE20K benchmarks. This omission prevents independent verification of the reported gains over DeiT and weakens support for the central claim that self-attention is unnecessary.
- [Method] §3 (Method, bidirectional Mamba blocks): The architecture flattens 2D patches into a 1D sequence, applies forward and backward selective SSM scans, and relies on learned positional encodings to restore position sensitivity. No ablation isolates the contribution of bidirectionality versus position embeddings, nor tests whether this indirect mechanism suffices for 2D neighborhood and diagonal relations that self-attention models explicitly.
- [Experiments] Results tables (e.g., ImageNet and downstream tasks): Performance deltas are presented without statistical significance tests, standard deviations across seeds, or controls for training compute. This makes it difficult to assess whether the efficiency-accuracy trade-off truly demonstrates that attention can be dispensed with.
minor comments (2)
- [Abstract] Abstract: The efficiency numbers (2.8× speed, 86.8% memory) are given for batch inference on 1248×1248 images; clarify whether these measurements include the full forward pass or only feature extraction.
- [Method] Figure 1 and architecture diagrams: The visualization of forward/backward scan paths on 2D patches would benefit from explicit annotation of how hidden states are merged across directions.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment point-by-point below and commit to revisions that improve reproducibility and analysis without altering the core claims.
read point-by-point responses
-
Referee: [Experiments] Experimental Setup and Results sections: Training protocols, optimizer settings, data augmentations, and number of runs are not fully specified for the ImageNet, COCO, and ADE20K benchmarks. This omission prevents independent verification of the reported gains over DeiT and weakens support for the central claim that self-attention is unnecessary.
Authors: We agree that fuller documentation aids verification. The training protocols follow standard practices for DeiT comparisons and are implemented in the released code, but we will expand the Experimental Setup section in the revision to explicitly list optimizer settings (AdamW, lr=5e-4, weight decay 0.05, cosine decay), data augmentations (RandAugment, Mixup, CutMix as in DeiT), and number of runs (mean and std over 3 seeds). This directly addresses the concern for independent reproduction. revision: yes
-
Referee: [Method] §3 (Method, bidirectional Mamba blocks): The architecture flattens 2D patches into a 1D sequence, applies forward and backward selective SSM scans, and relies on learned positional encodings to restore position sensitivity. No ablation isolates the contribution of bidirectionality versus position embeddings, nor tests whether this indirect mechanism suffices for 2D neighborhood and diagonal relations that self-attention models explicitly.
Authors: We acknowledge the value of isolating these factors. The manuscript motivates bidirectionality for capturing global context in both directions and positional embeddings for spatial awareness, but does not include a dedicated ablation. In the revision we will add an ablation study comparing bidirectional vs. unidirectional Mamba blocks and the effect of removing positional embeddings. This will clarify the contribution to modeling 2D relations and strengthen the methodological justification. revision: yes
-
Referee: [Experiments] Results tables (e.g., ImageNet and downstream tasks): Performance deltas are presented without statistical significance tests, standard deviations across seeds, or controls for training compute. This makes it difficult to assess whether the efficiency-accuracy trade-off truly demonstrates that attention can be dispensed with.
Authors: We accept this criticism. The reported gains are consistent across three diverse tasks, but we did not include std devs or formal tests in the tables. In the revised manuscript we will add standard deviations from multiple seeds to the main results tables and include a brief discussion of statistical significance and training compute controls (all models trained under matched protocols). This will better support the efficiency claims. revision: yes
Circularity Check
No significant circularity in empirical architecture proposal
full rationale
The paper proposes Vim as a vision backbone replacing self-attention with bidirectional Mamba blocks plus position embeddings, then validates via direct benchmark comparisons (ImageNet, COCO, ADE20K) against DeiT and other baselines. No equations, predictions, or first-principles results reduce to inputs by construction; the architecture is an explicit design choice whose sufficiency is tested externally and falsifiably. No self-citations are load-bearing for the central claim, and no fitted parameters are relabeled as predictions. This is a standard empirical contribution with independent content.
Axiom & Free-Parameter Ledger
free parameters (1)
- model hyperparameters and architectural choices in Vim
axioms (1)
- domain assumption Bidirectional state space models can represent visual data adequately when augmented with position embeddings
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclearwe show that the reliance on self-attention for visual representation learning is not necessary and propose a new generic vision backbone with bidirectional Mamba blocks (Vim), which marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models
-
IndisputableMonolith.Foundation.DimensionForcingdimension_forced unclearVim is 2.8× faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248×1248
Forward citations
Cited by 28 Pith papers
-
Selection, Not Fusion: Radar-Modulated State Space Models for Radar-Camera Depth Estimation
Radar-Modulated Selection perturbs only the step size Δ and readout C parameters inside Mamba's selective scan with radar data while keeping other components image-only, yielding state-of-the-art depth estimation on n...
-
TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles
TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
-
Rethink MAE with Linear Time-Invariant Dynamics
Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.
-
KAConvNet: Kolmogorov-Arnold Convolutional Networks for Vision Recognition
KAConvNet introduces a Kolmogorov-Arnold Convolutional Layer to build networks competitive with ViTs and CNNs while offering stronger theoretical interpretability.
-
GraphLeap: Decoupling Graph Construction and Convolution for Vision GNN Acceleration on FPGA
GraphLeap decouples per-layer graph construction from feature updates in Vision GNNs by using previous-layer features for the current graph, enabling pipelined FPGA acceleration with up to 95.7× CPU speedup after fine-tuning.
-
RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation
RSGMamba introduces a reliability-aware self-gated Mamba block for dynamic cross-modal feature selection in semantic segmentation, delivering state-of-the-art mIoU on RGB-D and RGB-T benchmarks with 48.6M parameters.
-
WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects
WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.
-
EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction
EmambaIR is a visual state space model with cross-modal top-k sparse attention and gated SSM components that outperforms prior CNN and ViT methods on event-guided deblurring, deraining, and HDR reconstruction while re...
-
GEM: Generating LiDAR World Model via Deformable Mamba
GEM is a new LiDAR world model using deformable Mamba that disentangles dynamic and static features to generate high-fidelity simulations and achieve state-of-the-art results on autonomous driving benchmarks.
-
BVI-Mamba: Video Enhancement Using a Visual State-Space Model for Low-Light and Underwater Environments
BVI-Mamba enhances low-light and underwater videos by combining feature alignment with a UNet architecture built from Visual State Space blocks, claiming better quality and efficiency than prior Transformer or convolu...
-
MambaBack: Bridging Local Features and Global Contexts in Whole Slide Image Analysis
MambaBack is a hybrid Mamba-CNN model with Hilbert sampling and chunked inference that reports better performance than seven prior methods on five whole-slide image datasets.
-
HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet
HAMSA achieves 85.7% ImageNet-1K top-1 accuracy as a spectral-domain SSM with 2.2x faster inference and lower memory than transformers or scanning-based SSMs.
-
Event-Adaptive State Transition and Gated Fusion for RGB-Event Object Tracking
MambaTrack improves RGB-Event object tracking via event-adaptive state transitions in a Dynamic State Space Model and a Gated Projection Fusion module, reporting state-of-the-art results on FE108 and FELT datasets.
-
PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer
PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.
-
YOLOv12: Attention-Centric Real-Time Object Detectors
YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.
-
ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs
ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.
-
SAMIC: A Lightweight Semantic-Aware Mamba for Efficient Perceptual Image Compression
SAMIC introduces semantic-aware Mamba blocks and SVD-based redundancy reduction to achieve efficient perceptual image compression with improved rate-distortion-perception tradeoffs.
-
TopoMamba: Topology-Aware Scanning and Fusion for Segmenting Heterogeneous Medical Visual Media
TopoMamba improves medical image segmentation by combining topology-aware diagonal scans with standard cross-scans and a HSIC Gate for efficient fusion, yielding gains on thin and curved targets like the pancreas.
-
Breaking the Resource Wall: Geometry-Guided Sequence Modeling for Efficient Semantic Segmentation
DGM-Net reaches 82.3% mIoU on Cityscapes and 45.24% on ADE20K using directional geometric guidance inside a linear-complexity Mamba backbone, without heavy pretraining or large models.
-
Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities
UniME combines a pretrained unified ViT encoder with modality-specific CNN encoders to improve brain tumor segmentation performance when some MRI modalities are missing.
-
MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation
MambaLiteUNet integrates Mamba into U-Net with adaptive fusion, local-global mixing, and cross-gated attention modules to reach 87.12% IoU and 93.09% Dice on skin lesion datasets while cutting parameters by 93.6%.
-
A Mamba-Based Multimodal Network for Multiscale Blast-Induced Rapid Structural Damage Assessment
A new Mamba multimodal network integrates multi-scale blast-loading information with satellite images to improve rapid structural damage assessment after explosions, showing gains over prior methods on the Beirut 2020 case.
-
Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection
A new adapter module combining boundary-aware state space modeling with spatial processing boosts localization and robustness in temporal action detection.
-
Beyond ZOH: Advanced Discretization Strategies for Vision Mamba
Bilinear discretization improves Vision Mamba accuracy over zero-order hold on classification, segmentation, and detection benchmarks with only modest extra training cost.
-
ConvVitMamba: Efficient Multiscale Convolution, Transformer, and Mamba-Based Sequence modelling for Hyperspectral Image Classification
ConvVitMamba integrates multiscale convolution, transformer encoding, and Mamba-based refinement with PCA to outperform prior CNN, ViT, and Mamba methods in accuracy, size, and speed on four HSI benchmark datasets.
-
Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection
MDDCNet combines Mamba blocks with deformable dilated convolutions, enhanced feed-forward networks, and an attention-aggregating feature pyramid to achieve better multi-scale traffic object detection than prior detectors.
-
Attention Is not Everything: Efficient Alternatives for Vision
A survey that taxonomizes non-Transformer vision models and evaluates their practical trade-offs across efficiency, scalability, and robustness.
-
A Hybrid Architecture for Benign-Malignant Classification of Mammography ROIs
Hybrid EfficientNetV2-M and Vision Mamba architecture achieves strong binary classification performance on abnormality-centered mammography ROIs from CBIS-DDSM.
Reference graph
Works this paper leans on
-
[1]
Beit: BERT pre-training of image transformers
Bao, H., Dong, L., Piao, S., and Wei, F. Beit: BERT pre-training of image transformers. In ICLR, 2022. URL https://openreview.net/forum?id=p-BhZSz59o4
work page 2022
-
[2]
2-d ssm: A general spatial layer for visual transformers
Baron, E., Zimerman, I., and Wolf, L. 2-d ssm: A general spatial layer for visual transformers. arXiv preprint arXiv:2306.06635, 2023
-
[3]
Introducing our multimodal models, 2023
Bavishi, R., Elsen, E., Hawthorne, C., Nye, M., Odena, A., Somani, A., and Ta s rlar, S. Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b
work page 2023
-
[4]
Cai, Z. and Vasconcelos, N. Cascade r-cnn: High quality object detection and instance segmentation. TPAMI, 2019
work page 2019
-
[5]
Emerging properties in self-supervised vision transformers
Caron, M., Touvron, H., Misra, I., J \'e gou, H., Mairal, J., Bojanowski, P., and Joulin, A. Emerging properties in self-supervised vision transformers. In ICCV, 2021
work page 2021
-
[6]
Generating Long Sequences with Sparse Transformers
Child, R., Gray, S., Radford, A., and Sutskever, I. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[7]
M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J
Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., and Weller, A. Rethinking attention with performers. In ICLR, 2021. URL https://openreview.net/forum?id=Ua6zuk0WRH
work page 2021
-
[8]
Dai, Z., Liu, H., Le, Q. V., and Tan, M. Coatnet: Marrying convolution and attention for all data sizes. NeurIPS, 34, 2021
work page 2021
-
[9]
Imagenet: A large-scale hierarchical image database
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In CVPR, 2009
work page 2009
-
[10]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[11]
LongNet: Scaling transformers to 1,000,000,000 tokens.arXiv preprint arXiv:2307.02486,
Ding, J., Ma, S., Dong, L., Zhang, X., Huang, S., Wang, W., Zheng, N., and Wei, F. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486, 2023
-
[12]
Scaling up your kernels to 31x31: Revisiting large kernel design in cnns
Ding, X., Zhang, X., Han, J., and Ding, G. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In CVPR, 2022
work page 2022
-
[13]
Cswin transformer: A general vision transformer backbone with cross-shaped windows
Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., and Guo, B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR, 2022
work page 2022
-
[14]
An image is worth 16x16 words: Transformers for image recognition at scale
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020
work page 2020
-
[15]
d’Ascoli, S., Touvron, H., Leavitt, M. L., Morcos, A. S., Biroli, G., and Sagun, L. Convit: Improving vision transformers with soft convolutional inductive biases. In ICML, 2021
work page 2021
-
[16]
Msg-transformer: Exchanging local spatial information by manipulating messenger tokens
Fang, J., Xie, L., Wang, X., Zhang, X., Liu, W., and Tian, Q. Msg-transformer: Exchanging local spatial information by manipulating messenger tokens. In CVPR, 2022
work page 2022
-
[17]
Eva: Exploring the limits of masked visual representation learning at scale
Fang, Y., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., Huang, T., Wang, X., and Cao, Y. Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023
work page 2023
-
[18]
Fu, D. Y., Dao, T., Saab, K. K., Thomas, A. W., Rudra, A., and Re, C. Hungry hungry hippos: Towards language modeling with state space models. In ICLR, 2023. URL https://openreview.net/forum?id=COZDy0WYGg
work page 2023
- [19]
-
[20]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Efficiently Modeling Long Sequences with Structured State Spaces
Gu, A., Goel, K., and R \'e , C. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021 a
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[22]
Combining recurrent, convolutional, and continuous-time models with linear state space layers
Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., and R \'e , C. Combining recurrent, convolutional, and continuous-time models with linear state space layers. In NeurIPS, 2021 b
work page 2021
-
[23]
On the parameterization and initialization of diagonal state space models
Gu, A., Goel, K., Gupta, A., and R \'e , C. On the parameterization and initialization of diagonal state space models. In NeurIPS, 2022
work page 2022
-
[24]
Diagonal state spaces are as effective as structured state spaces
Gupta, A., Gu, A., and Berant, J. Diagonal state spaces are as effective as structured state spaces. In NeurIPS, 2022
work page 2022
-
[25]
Deep residual learning for image recognition
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, 2016
work page 2016
-
[26]
Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In CVPR, 2017
work page 2017
-
[27]
Islam, M. M. and Bertasius, G. Long movie clip classification with state-space video models. In ECCV, 2022
work page 2022
-
[28]
Islam, M. M., Hasan, M., Athrey, K. S., Braskich, T., and Bertasius, G. Efficient movie scene detection using state-space transformers. In CVPR, 2023
work page 2023
-
[29]
Scaling up visual and vision-language representation learning with noisy text supervision
Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q., Sung, Y.-H., Li, Z., and Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021
work page 2021
-
[30]
Kalman, R. E. A new approach to linear filtering and prediction problems. 1960
work page 1960
-
[31]
Kenton, J. D. M.-W. C. and Toutanova, L. K. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019
work page 2019
-
[32]
Reformer: The efficient transformer
Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The efficient transformer. In ICLR, 2020. URL https://openreview.net/forum?id=rkgNKkHtvB
work page 2020
-
[33]
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In NeurIPS, 2012
work page 2012
-
[34]
Gradient-based learning applied to document recognition
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86 0 (11): 0 2278--2324, 1998
work page 1998
-
[35]
Li, J., Li, D., Xiong, C., and Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022 a
work page 2022
-
[36]
Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023
work page internal anchor Pith review arXiv 2023
-
[37]
What makes convolutional models great on long sequence modeling? In ICLR, 2022 b
Li, Y., Cai, T., Zhang, Y., Chen, D., and Dey, D. What makes convolutional models great on long sequence modeling? In ICLR, 2022 b
work page 2022
-
[38]
Exploring plain vision transformer backbones for object detection
Li, Y., Mao, H., Girshick, R., and He, K. Exploring plain vision transformer backbones for object detection. In ECCV, 2022 c
work page 2022
-
[39]
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll \'a r, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In ECCV, 2014
work page 2014
-
[40]
Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity
Liu, S., Chen, T., Chen, X., Chen, X., Xiao, Q., Wu, B., K \"a rkk \"a inen, T., Pechenizkiy, M., Mocanu, D., and Wang, Z. More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity. arXiv preprint arXiv:2207.03620, 2022 a
-
[42]
https:// arxiv.org/abs/2401.10166
Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., and Liu, Y. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024
-
[43]
Swin transformer: Hierarchical vision transformer using shifted windows
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021
work page 2021
-
[44]
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., and Xie, S. A convnet for the 2020s. In CVPR, 2022 b
work page 2022
-
[45]
Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In ICLR, 2019
work page 2019
-
[46]
U-mamba: Enhancing long-range dependency for biomedical image segmentation
Ma, J., Li, F., and Wang, B. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024
-
[47]
Long range language modeling via gated state spaces
Mehta, H., Gupta, A., Cutkosky, A., and Neyshabur, B. Long range language modeling via gated state spaces. In ICLR, 2023. URL https://openreview.net/forum?id=5MkYIYCbva
work page 2023
-
[48]
S4nd: Modeling images and videos as multidimensional signals with state spaces
Nguyen, E., Goel, K., Gu, A., Downs, G., Shah, P., Dao, T., Baccus, S., and R \'e , C. S4nd: Modeling images and videos as multidimensional signals with state spaces. In NeurIPS, 2022
work page 2022
-
[49]
Hierarchically gated recurrent neural network for sequence modeling
Qin, Z., Yang, S., and Zhong, Y. Hierarchically gated recurrent neural network for sequence modeling. In NeurIPS, 2023. URL https://openreview.net/forum?id=P1TCHxJwLB
work page 2023
-
[50]
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In ICML, 2021
work page 2021
-
[51]
P., Girshick, R., He, K., and Doll \'a r, P
Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., and Doll \'a r, P. Designing network design spaces. In CVPR, 2020
work page 2020
-
[52]
Global filter networks for image classification
Rao, Y., Zhao, W., Zhu, Z., Lu, J., and Zhou, J. Global filter networks for image classification. Advances in neural information processing systems, 34: 0 980--993, 2021
work page 2021
-
[53]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[54]
T., De Mello, S., Kautz, J., Linderman, S., and Byeon, W
Smith, J. T., De Mello, S., Kautz, J., Linderman, S., and Byeon, W. Convolutional state space models for long-range spatiotemporal modeling. In NeurIPS, 2023 a
work page 2023
-
[55]
T., Warrington, A., and Linderman, S
Smith, J. T., Warrington, A., and Linderman, S. Simplified state space layers for sequence modeling. In ICLR, 2023 b . URL https://openreview.net/forum?id=Ai8Hw3AXqks
work page 2023
-
[56]
Segmenter: Transformer for semantic segmentation
Strudel, R., Garcia, R., Laptev, I., and Schmid, C. Segmenter: Transformer for semantic segmentation. In ICCV, 2021
work page 2021
-
[57]
Retentive Network: A Successor to Transformer for Large Language Models
Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., and Wei, F. Retentive network: A successor to transformer for large language modelss. arXiv preprint arXiv:2307.08621, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Going deeper with convolutions
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In CVPR, 2015
work page 2015
- [59]
- [60]
-
[61]
Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A., Keysers, D., Uszkoreit, J., et al. Mlp-mixer: An all-mlp architecture for vision. In NeurIPS, 2021
work page 2021
-
[62]
Training data-efficient image transformers & distillation through attention
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and J \'e gou, H. Training data-efficient image transformers & distillation through attention. In ICML, 2021 a
work page 2021
-
[63]
Training data-efficient image transformers & distillation through attention
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and J \'e gou, H. Training data-efficient image transformers & distillation through attention. In ICML, 2021 b
work page 2021
-
[64]
Resmlp: Feedforward networks for image classification with data-efficient training
Touvron, H., Bojanowski, P., Caron, M., Cord, M., El-Nouby, A., Grave, E., Izacard, G., Joulin, A., Synnaeve, G., Verbeek, J., et al. Resmlp: Feedforward networks for image classification with data-efficient training. TPAMI, 2022
work page 2022
-
[65]
Deep high-resolution representation learning for visual recognition
Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., et al. Deep high-resolution representation learning for visual recognition. TPAMI, 2020 a
work page 2020
-
[66]
Wang, J., Yan, J. N., Gu, A., and Rush, A. M. Pretraining without attention. arXiv preprint arXiv:2212.10544, 2022
-
[67]
Selective structured state-spaces for long-form video understanding
Wang, J., Zhu, W., Wang, P., Yu, X., Liu, L., Omar, M., and Hamid, R. Selective structured state-spaces for long-form video understanding. In CVPR, 2023 a
work page 2023
-
[68]
Linformer: Self-Attention with Linear Complexity
Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020 b
work page internal anchor Pith review arXiv 2006
-
[69]
Pyramid vision transformer: A versatile backbone for dense prediction without convolutions
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., and Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021
work page 2021
-
[70]
Internimage: Exploring large-scale vision foundation models with deformable convolutions
Wang, W., Dai, J., Chen, Z., Huang, Z., Li, Z., Zhu, X., Hu, X., Lu, T., Lu, L., Li, H., et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In CVPR, 2023 b
work page 2023
-
[71]
When an image is worth 1,024 x 1,024 words: A case study in computational pathology
Wang, W., Ma, S., Xu, H., Usuyama, N., Ding, J., Poon, H., and Wei, F. When an image is worth 1,024 x 1,024 words: A case study in computational pathology. arXiv preprint arXiv:2312.03558, 2023 c
-
[72]
Cvt: Introducing convolutions to vision transformers
Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., and Zhang, L. Cvt: Introducing convolutions to vision transformers. In ICCV, 2021
work page 2021
-
[73]
Unified perceptual parsing for scene understanding
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., and Sun, J. Unified perceptual parsing for scene understanding. In ECCV, 2018 a
work page 2018
-
[74]
Unified perceptual parsing for scene understanding
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., and Sun, J. Unified perceptual parsing for scene understanding. In ECCV, 2018 b
work page 2018
-
[75]
Aggregated residual transformations for deep neural networks
Xie, S., Girshick, R., Doll \'a r, P., Tu, Z., and He, K. Aggregated residual transformations for deep neural networks. In CVPR, 2017
work page 2017
-
[76]
Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation
Xing, Z., Ye, T., Yang, Y., Liu, G., and Zhu, L. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. arXiv preprint arXiv:2401.13560, 2024
-
[77]
Yan, J. N., Gu, J., and Rush, A. M. Diffusion models without attention. arXiv preprint arXiv:2311.18257, 2023
-
[78]
Yang, J., Li, C., Zhang, P., Dai, X., Xiao, B., Yuan, L., and Gao, J. Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641, 2021
-
[79]
Metaformer is actually what you need for vision
Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., and Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10819--10829, 2022
work page 2022
-
[80]
Semantic understanding of scenes through the ade20k dataset
Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., and Torralba, A. Semantic understanding of scenes through the ade20k dataset. IJCV, 2019
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.