arxiv: 2605.11300 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Can Graphs Help Vision SSMs See Better?

Anvitha Ramachandran, Dhruv Parikh, Haoyang Fan, Mustafa Munir, Rajgopal Kannan, Viktor Prasanna

Pith reviewed 2026-05-13 06:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords GraphScanvision SSMMambadynamic scanninggraph message passingsemantic routingobject detectionimage segmentation

0 comments

The pith

GraphScan adds local graph message passing to boost vision state space models on classification and segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that vision state space models can perform better by inserting an explicit local semantic routing step using graphs before their global selective scanning. Existing scan methods rely on fixed geometric paths or coordinate offsets, but GraphScan builds a spatially limited graph around each token, computes affinities from features plus relative position, and applies one round of message passing to mix information only within learned semantic neighborhoods. The locally updated tokens then enter the standard SSM for long-range aggregation. If this works, it improves accuracy on image classification, object detection, instance segmentation, and semantic segmentation while preserving linear scaling and token count. Readers would care because it reframes the scan operator as learned local grounding rather than pure serialization.

Core claim

We introduce GraphScan, a graph-induced dynamic scanning operator for Vision SSMs. For each token, GraphScan constructs a spatially bounded local graph, learns feature-conditioned affinities with relative positional bias, and produces the output token by one-step message passing over its semantic neighborhood. The resulting tokens are locally grounded before being processed by the selective SSM for global aggregation. Integrated into a hierarchical backbone, GraphScan-Mamba achieves state-of-the-art performance among Vision SSMs across image classification, object detection, instance segmentation, and semantic segmentation, with modest computational overhead. The operator also induces analyz

What carries the argument

GraphScan: a graph-induced dynamic scanning operator that builds spatially bounded local graphs, learns feature-conditioned affinities with positional bias, and performs one-step message passing to ground tokens semantically before selective SSM processing.

If this is right

GraphScan-Mamba sets new accuracy records among vision SSMs on image classification, object detection, instance segmentation, and semantic segmentation.
Token count and linear complexity in image size remain unchanged.
Coordinate-based interpolation is replaced by feature-driven semantic routing.
The method produces interpretable displacement fields over the token lattice that reveal semantic and spatial structure.
Future vision SSM designs should combine learned local semantic routing with global state-space mixing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same local-graph routing idea could be tested inside transformer blocks or other sequence models for vision.
The induced displacement fields might serve as a diagnostic tool for understanding how models group visual content.
Extending the bounded graphs to temporal neighborhoods could improve video processing without quadratic cost.
Combining GraphScan with existing dynamic sampling methods might compound gains while keeping overhead low.

Load-bearing premise

The learned feature affinities within each local spatial graph will reliably form semantic neighborhoods that improve the downstream selective SSM without needing task-specific retuning.

What would settle it

If replacing GraphScan with a standard geometric scan in the same backbone yields equal or higher accuracy on the same detection and segmentation benchmarks, the benefit of the graph operator would be refuted.

Figures

Figures reproduced from arXiv: 2605.11300 by Anvitha Ramachandran, Dhruv Parikh, Haoyang Fan, Mustafa Munir, Rajgopal Kannan, Viktor Prasanna.

**Figure 1.** Figure 1: Comparison of scanning paradigms for Vision SSMs. Raster and continuous scans expose the SSM to a fixed 1D ordering of the 2D lattice; local-window scans restrict it to bounded predefined regions; coordinate/offset scans introduce input dependence by predicting sampling locations and rerouting through geometric displacement. GraphScan instead keeps token slots fixed and constructs a local semantic graph ar… view at source ↗

**Figure 2.** Figure 2: GraphScan. Each token attends to a bounded local neighborhood, aggregates semantically weighted neighboring features, and produces a routed token x ′ i . The routed feature map X′ is then passed to S6. the affinity score is sij = qik ⊤ j √ d + brel(gj − gi), αij = exp(sij ) P ℓ∈Sr(i) exp(siℓ) . (2) The learnable relative bias brel is defined over discrete offsets inside the local window. Unless otherwi… view at source ↗

**Figure 3.** Figure 3: GraphScan-Mamba architecture. Top: four-stage hierarchical backbone with an overlapping convolutional stem to 1/4 resolution and stage-wise downsampling, with stage-wise graph radii r = (1, 1, 2, 3) yielding 3×3, 3×3, 5×5, and 7×7 local neighborhoods. Middle: each of Ni blocks combines a depthwise positional branch, a GraphScan-SSM token-mixing branch, and a ConvFFN branch. Bottom: inside GSSM, features pa… view at source ↗

**Figure 4.** Figure 4: Stage-3 GraphScan attention on two ImageNet val images. For each query patch (yellow box), the 14×14 attention heatmap concentrates on the L∞=2 outer ring; the central 3×3 block is suppressed because the preceding depthwise convolution already covers it. Sports car: displacement Sports car: scan path Persian cat: displacement Persian cat: scan path [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Stage-4 effective routing. Per-patch displacement field (HSV by direction) and the polyline visiting displaced positions in raster order (plasma, start •, end ■). Both converge onto the discriminative object content. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

read the original abstract

Vision state space models inherit the efficiency and long-range modeling ability of Mamba-style selective scans. However, their performance depends critically on the representation of two-dimensional visual features as one-dimensional token sequences. Existing scan operators range from predefined geometric traversals to dynamic coordinate-based samplers that reroute tokens through predicted offsets and interpolation. While effective, these mechanisms primarily adapt paths or sampling locations, rather than explicitly modeling which local patches should exchange information before global state-space mixing. This motivates a simple question: \emph{can graphs help vision state space models see better?} We introduce \textbf{GraphScan}, a graph-induced dynamic scanning operator for Vision SSMs. For each token, GraphScan constructs a spatially bounded local graph, learns feature-conditioned affinities with relative positional bias, and produces the output token by one-step message passing over its semantic neighborhood. The resulting tokens are locally grounded before being processed by the selective SSM for global aggregation. GraphScan preserves token count and linear scaling in image size, while replacing coordinate-conditioned interpolation with feature-conditioned semantic routing. Integrated into a hierarchical backbone, \textbf{GraphScan-Mamba} achieves state-of-the-art performance among Vision SSMs across image classification, object detection, instance segmentation, and semantic segmentation, with modest computational overhead. Our analysis further shows that GraphScan induces interpretable displacement fields over the token lattice, providing a semantic and spatially grounded view of dynamic scanning. These results suggest that future Vision SSMs should treat scanning not merely as geometric serialization, but as learned local semantic routing before global state-space modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GraphScan inserts a local graph message-passing step with learned affinities before the selective SSM, but the SOTA claims rest on experiments that are not shown here and the semantic-neighborhood assumption is unverified.

read the letter

The main contribution is a drop-in scanning operator that builds a spatially bounded graph per token, computes feature-conditioned affinities with relative position bias, and does one round of message passing to produce locally grounded tokens for the downstream SSM. This differs from prior geometric traversals or offset-based interpolation by explicitly routing information through learned semantic neighbors while preserving linear scaling and token count. The framing of scanning as local semantic routing rather than just serialization is a clean conceptual shift, and the claim that the resulting displacement fields are interpretable is worth testing if the paper supplies the visuals and metrics to back it. The integration into a hierarchical backbone for classification, detection, and segmentation tasks follows standard practice and keeps computational overhead modest on paper. The central weakness is that the abstract states state-of-the-art results across tasks without any numbers, tables, ablations, or baseline comparisons visible in the text. That makes it impossible to judge whether the gains exceed what a small convolutional layer or simpler local aggregator could deliver at lower cost. The stress-test concern lands: if the affinities mainly track low-level cues like texture or gradients inside the spatial window, the message-passing step reduces to learned smoothing whose benefit is not obviously tied to semantics. No quantitative check on neighborhood quality, such as agreement with labeled edges or semantic boundaries, is referenced. This work is aimed at people already building or benchmarking Vision SSMs who want a new local operator to try. A reader focused on efficient 2D backbones would find the design pattern useful to replicate and stress-test even if the performance numbers require independent confirmation. It deserves peer review because the operator is well-specified and the motivation is grounded in existing scan limitations, though the experimental section will need careful examination for reproducibility and controls.

Referee Report

3 major / 2 minor

Summary. The paper proposes GraphScan, a graph-induced dynamic scanning operator for Vision SSMs. For each token, it builds a spatially bounded local graph, computes feature-conditioned affinities augmented by relative positional bias, and applies one-step message passing over the resulting semantic neighborhood to produce locally grounded tokens. These tokens are then fed into a standard selective SSM for global aggregation. The resulting GraphScan-Mamba hierarchical backbone is reported to achieve state-of-the-art results among Vision SSMs on image classification, object detection, instance segmentation, and semantic segmentation while incurring only modest computational overhead; the method is also claimed to produce interpretable displacement fields over the token lattice.

Significance. If the empirical claims are substantiated, the work provides a concrete demonstration that explicit local graph-based semantic routing can improve the local grounding of tokens before global state-space mixing in Vision SSMs. This reframes scanning as learned neighborhood construction rather than purely geometric serialization or coordinate-based resampling, which could influence the design of future efficient vision backbones that combine graph and SSM primitives. The preservation of linear complexity and token count is a practical strength.

major comments (3)

[Abstract, §4] Abstract and §4 (Experiments): The central SOTA claim among Vision SSMs is asserted without any quantitative tables, ablation studies, error bars, or detailed baseline comparisons in the provided manuscript. This absence makes it impossible to assess the magnitude of improvement, statistical significance, or whether the gains are attributable to GraphScan rather than other implementation details.
[§3.2] §3.2 (Graph Construction and Message Passing): The paper does not supply a quantitative validation (e.g., edge-class agreement with semantic labels or comparison against low-level edge detectors) that the learned feature-conditioned affinities with relative positional bias produce semantic neighborhoods rather than low-level feature correlations. Without such a check, the one-step message passing could reduce to learned local averaging whose benefit might be replicated by a small convolution at lower cost, undermining the semantic-routing interpretation.
[§3.3] §3.3 (Integration with Selective SSM): No equations or ablation are provided showing how the output of the graph message-passing step is exactly injected into the SSM recurrence (e.g., whether it replaces the input projection, augments the state, or is simply concatenated). This leaves open whether the reported gains require the full GraphScan machinery or could be obtained by a simpler local operator.

minor comments (2)

[§3.1] Notation for the affinity function and relative positional bias is introduced without an explicit equation number or pseudocode listing, making the precise computation of edge weights difficult to reproduce from the text alone.
[§3] The manuscript would benefit from a small diagram or pseudocode block illustrating the full forward pass of one GraphScan layer, including the transition from graph output to SSM input.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We will revise the manuscript to address the concerns raised regarding the presentation of empirical results and technical details of the method.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (Experiments): The central SOTA claim among Vision SSMs is asserted without any quantitative tables, ablation studies, error bars, or detailed baseline comparisons in the provided manuscript. This absence makes it impossible to assess the magnitude of improvement, statistical significance, or whether the gains are attributable to GraphScan rather than other implementation details.

Authors: We agree that the empirical claims require stronger substantiation through quantitative results. In the revised manuscript, we will expand the abstract to reference key performance metrics and include detailed tables in §4 with comparisons to existing Vision SSMs such as VMamba and others on ImageNet classification, COCO object detection, and segmentation tasks. We will also add ablation studies, report results with error bars from multiple runs, and provide more granular baseline comparisons to demonstrate that the improvements stem from the GraphScan operator. revision: yes
Referee: [§3.2] §3.2 (Graph Construction and Message Passing): The paper does not supply a quantitative validation (e.g., edge-class agreement with semantic labels or comparison against low-level edge detectors) that the learned feature-conditioned affinities with relative positional bias produce semantic neighborhoods rather than low-level feature correlations. Without such a check, the one-step message passing could reduce to learned local averaging whose benefit might be replicated by a small convolution at lower cost, undermining the semantic-routing interpretation.

Authors: We acknowledge the value of quantitative validation for the semantic interpretation. We will add a new analysis in the revised §3.2, including metrics such as the agreement rate between the learned graph edges and semantic class boundaries derived from ground-truth labels on a held-out set. Additionally, we will compare the learned affinities against those produced by standard low-level edge detectors like Canny or Sobel to highlight the semantic conditioning. This will help distinguish the approach from simple local averaging and support the semantic-routing claim. revision: yes
Referee: [§3.3] §3.3 (Integration with Selective SSM): No equations or ablation are provided showing how the output of the graph message-passing step is exactly injected into the SSM recurrence (e.g., whether it replaces the input projection, augments the state, or is simply concatenated). This leaves open whether the reported gains require the full GraphScan machinery or could be obtained by a simpler local operator.

Authors: We appreciate this observation on the need for precise technical specification. In the revised §3.3, we will include explicit mathematical formulations showing that the message-passing output directly substitutes for the input features in the selective SSM's input projection. Furthermore, we will incorporate an ablation experiment replacing the graph-based message passing with a standard local convolution to quantify the additional benefit of the semantic graph routing. This will clarify the necessity of the full GraphScan design for achieving the reported performance gains. revision: yes

Circularity Check

0 steps flagged

No circularity: GraphScan is an independent architectural addition whose performance claims rest on empirical benchmarks rather than self-referential definitions or fitted inputs.

full rationale

The paper presents GraphScan as a new module that builds spatially bounded local graphs, computes feature-conditioned affinities with positional bias, and applies one-step message passing before feeding tokens into the selective SSM. No equations or derivations reduce the claimed semantic routing benefit to a fitted parameter renamed as a prediction, nor does any load-bearing step rely on a self-citation chain or uniqueness theorem imported from the authors' prior work. The SOTA results are reported as experimental outcomes on standard vision tasks; the method does not derive its improvements by construction from its own inputs. This is the common case of a self-contained architectural proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim depends on the empirical effectiveness of learned local affinities and one-step message passing; no explicit free parameters, axioms, or invented physical entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5595 in / 1032 out tokens · 105109 ms · 2026-05-13T06:08:36.524242+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 2 internal anchors

[1]

Relational inductive biases, deep learning, and graph networks

Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks.arXiv preprint arXiv:1806.01261, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Vision transformer adapter for dense predictions

Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[3]

Twins: Revisiting the design of spatial attention in vision transformers.Advances in neural information processing systems, 34:9355–9366, 2021

Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Twins: Revisiting the design of spatial attention in vision transformers.Advances in neural information processing systems, 34:9355–9366, 2021

work page 2021
[4]

Conditional positional encodings for vision transformers

Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, and Chunhua Shen. Conditional positional encodings for vision transformers. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[5]

Coatnet: Marrying convolution and attention for all data sizes.Advances in neural information processing systems, 34:3965–3977, 2021

Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. Coatnet: Marrying convolution and attention for all data sizes.Advances in neural information processing systems, 34:3965–3977, 2021

work page 2021
[6]

Transformers are ssms: generalized models and efficient algorithms through structured state space duality

Tri Dao and Albert Gu. Transformers are ssms: generalized models and efficient algorithms through structured state space duality. InProceedings of the 41st International Conference on Machine Learning. JMLR.org, 2024

work page 2024
[7]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009

work page 2009
[8]

Unireplknet: A universal perception large-kernel convnet for audio video point cloud time-series and image recognition

Xiaohan Ding, Yiyuan Zhang, Yixiao Ge, Sijie Zhao, Lin Song, Xiangyu Yue, and Ying Shan. Unireplknet: A universal perception large-kernel convnet for audio video point cloud time-series and image recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5513–5524, 2024

work page 2024
[9]

Cswin transformer: A general vision transformer backbone with cross-shaped windows

Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12124–12134, 2022

work page 2022
[10]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[11]

Hungry hungry hippos: Towards language modeling with state space models

Daniel Y Fu, Tri Dao, Khaled Kamal Saab, Armin W Thomas, Atri Rudra, and Christopher Re. Hungry hungry hippos: Towards language modeling with state space models. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[12]

Neural message passing for quantum chemistry

Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. InInternational conference on machine learning, pages 1263–1272. Pmlr, 2017

work page 2017
[13]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst conference on language modeling, 2024

work page 2024
[14]

On the parameterization and initialization of diagonal state space models.Advances in neural information processing systems, 35:35971–35983, 2022

Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré. On the parameterization and initialization of diagonal state space models.Advances in neural information processing systems, 35:35971–35983, 2022

work page 2022
[15]

Efficiently modeling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. InInternational Conference on Learning Representations, 2022

work page 2022
[16]

Vision gnn: An image is worth graph of nodes.Advances in neural information processing systems, 35:8291–8303, 2022

Kai Han, Yunhe Wang, Jianyuan Guo, Yehui Tang, and Enhua Wu. Vision gnn: An image is worth graph of nodes.Advances in neural information processing systems, 35:8291–8303, 2022

work page 2022
[17]

Neighborhood attention transformer

Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. Neighborhood attention transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6185–6194, 2023

work page 2023
[18]

Mambavision: A hybrid mamba-transformer vision backbone

Ali Hatamizadeh and Jan Kautz. Mambavision: A hybrid mamba-transformer vision backbone. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 25261–25270, 2025

work page 2025
[19]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[20]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In2017 IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017

work page 2017
[21]

Densely connected convolutional networks

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017

work page 2017
[22]

Localmamba: Visual state space model with windowed selective scan

Tao Huang, Xiaohuan Pei, Shan You, Fei Wang, Chen Qian, and Chang Xu. Localmamba: Visual state space model with windowed selective scan. InEuropean conference on computer vision, pages 12–22. Springer, 2024. 10

work page 2024
[23]

R. E. Kalman. A new approach to linear filtering and prediction problems.Journal of Basic Engineering, 82(1):35–45, 1960

work page 1960
[24]

Similarity-aware selective state-space modeling for semantic correspon- dence

Seungwook Kim and Minsu Cho. Similarity-aware selective state-space modeling for semantic correspon- dence. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 6147–6157, 2025

work page 2025
[25]

Kipf and Max Welling

Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017

work page 2017
[26]

Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

work page 2012
[27]

Mamba-3: Improved sequence modeling using state space principles

Aakash Lahoti, Kevin Li, Berlin Chen, Caitlin Wang, Aviv Bick, J Zico Kolter, Tri Dao, and Albert Gu. Mamba-3: Improved sequence modeling using state space principles. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[28]

Lecun, L

Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998

work page 1998
[29]

Mamba-nd: Selective state space modeling for multi- dimensional data

Shufan Li, Harkanwar Singh, and Aditya Grover. Mamba-nd: Selective state space modeling for multi- dimensional data. InEuropean Conference on Computer Vision, pages 75–92. Springer, 2024

work page 2024
[30]

DAMamba: Vision state space model with dynamic adaptive scan

Tanzhe Li, Caoshuo Li, Jiayi Lyu, Hongjuan Pei, Baochang Zhang, Taisong Jin, and Rongrong Ji. DAMamba: Vision state space model with dynamic adaptive scan. InThe Thirty-ninth Annual Con- ference on Neural Information Processing Systems, 2025

work page 2025
[31]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. InComputer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing

work page 2014
[32]

Pay attention to mlps.Advances in neural information processing systems, 34:9204–9215, 2021

Hanxiao Liu, Zihang Dai, David So, and Quoc V Le. Pay attention to mlps.Advances in neural information processing systems, 34:9204–9215, 2021

work page 2021
[33]

Defmamba: Deformable visual state space model

Leiye Liu, Miao Zhang, Jihao Yin, Tingwei Liu, Wei Ji, Yongri Piao, and Huchuan Lu. Defmamba: Deformable visual state space model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8838–8847, 2025

work page 2025
[34]

More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity

Shiwei Liu, Tianlong Chen, Xiaohan Chen, Xuxi Chen, Qiao Xiao, Boqian Wu, Tommi Kärkkäinen, Mykola Pechenizkiy, Decebal Constantin Mocanu, and Zhangyang Wang. More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[35]

Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2024

Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2024

work page 2024
[36]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

work page 2021
[37]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022

work page 2022
[38]

MMDetection.https://github.com/open-mmlab/mmdetection

MMDetection Contributors. MMDetection.https://github.com/open-mmlab/mmdetection

work page
[39]

MMSegmentation

MMSegmentation Contributors. MMSegmentation. https://github.com/open-mmlab/ mmsegmentation

work page
[40]

Vcmamba: Bridging convolutions with multi- directional mamba for efficient visual representation

Mustafa Munir, Alex Zhang, and Radu Marculescu. Vcmamba: Bridging convolutions with multi- directional mamba for efficient visual representation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 3037–3046, 2025

work page 2025
[41]

Efficientvmamba: Atrous selective scan for light weight visual mamba

Xiaohuan Pei, Tao Huang, and Chang Xu. Efficientvmamba: Atrous selective scan for light weight visual mamba. InProceedings of the AAAI conference on artificial intelligence, pages 6443–6451, 2025

work page 2025
[42]

Hyena hierarchy: Towards larger convolutional language models

Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning, pages 28043–28078. PMLR, 2023

work page 2023
[43]

Designing network design spaces

Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. Designing network design spaces. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10428–10436, 2020

work page 2020
[44]

Group- mamba: Efficient group-based visual state space model

Abdelrahman Shaker, Syed Talal Wasim, Salman Khan, Juergen Gall, and Fahad Shahbaz Khan. Group- mamba: Efficient group-based visual state space model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14912–14922, 2025

work page 2025
[45]

Multi-scale vmamba: Hierarchy in hierarchy visual state space model.Advances in Neural Information Processing Systems, 37:25687–25708, 2024

Yuheng Shi, Minjing Dong, and Chang Xu. Multi-scale vmamba: Hierarchy in hierarchy visual state space model.Advances in Neural Information Processing Systems, 37:25687–25708, 2024

work page 2024
[46]

Vssd: Vision mamba with non-causal state space duality

Yuheng Shi, Mingjia Li, Minjing Dong, and Chang Xu. Vssd: Vision mamba with non-causal state space duality. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10819–10829, 2025. 11

work page 2025
[47]

Very deep convolutional networks for large-scale image recogni- tion

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni- tion. In3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Association for Computing Machinery, 2015

work page 2015
[48]

Smith, Andrew Warrington, and Scott Linderman

Jimmy T.H. Smith, Andrew Warrington, and Scott Linderman. Simplified state space layers for sequence modeling. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[49]

Springer Science & Business Media, 2013

Eduardo D Sontag.Mathematical control theory: deterministic finite dimensional systems. Springer Science & Business Media, 2013

work page 2013
[50]

Going deeper with convolutions

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015

work page 2015
[51]

Efficientnet: Rethinking model scaling for convolutional neural networks

Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019

work page 2019
[52]

Scalable visual state space model with fractal scanning.arXiv preprint arXiv:2405.14480, 2024

Lv Tang, HaoKe Xiao, Peng-Tao Jiang, Hao Zhang, Jinwei Chen, and Bo Li. Scalable visual state space model with fractal scanning.arXiv preprint arXiv:2405.14480, 2024

work page arXiv 2024
[53]

Mlp-mixer: An all-mlp architecture for vision.Advances in neural information processing systems, 34:24261–24272, 2021

Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision.Advances in neural information processing systems, 34:24261–24272, 2021

work page 2021
[54]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. InInternational conference on machine learning, pages 10347–10357. PMLR, 2021

work page 2021
[55]

Resmlp: Feedforward networks for image classification with data-efficient training.IEEE transactions on pattern analysis and machine intelligence, 45(4):5314–5321, 2022

Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, et al. Resmlp: Feedforward networks for image classification with data-efficient training.IEEE transactions on pattern analysis and machine intelligence, 45(4):5314–5321, 2022

work page 2022
[56]

Maxvit: Multi-axis vision transformer

Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, and Yinxiao Li. Maxvit: Multi-axis vision transformer. InEuropean conference on computer vision, pages 459–479. Springer, 2022

work page 2022
[57]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[58]

Graph attention networks

Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. InInternational Conference on Learning Representations, 2018

work page 2018
[59]

Repvit: Revisiting mobile cnn from vit perspective

Ao Wang, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Repvit: Revisiting mobile cnn from vit perspective. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15909–15920, 2024

work page 2024
[60]

Asm-unet: Adaptive scan mamba integrating group commonalities and individual variations for fine-grained segmentation, 2025

Bo Wang, Mengyuan Xu, Yue Yan, Yuqun Yang, Kechen Shu, Wei Ping, Xu Tang, Wei Jiang, and Zheng You. Asm-unet: Adaptive scan mamba integrating group commonalities and individual variations for fine-grained segmentation, 2025

work page 2025
[61]

Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. InProceedings of the IEEE/CVF international conference on computer vision, pages 568–578, 2021

work page 2021
[62]

Pvt v2: Improved baselines with pyramid vision transformer.Computational visual media, 8 (3):415–424, 2022

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pvt v2: Improved baselines with pyramid vision transformer.Computational visual media, 8 (3):415–424, 2022

work page 2022
[63]

Internimage: Exploring large-scale vision foundation models with deformable convolutions

Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14408–14419, 2023

work page 2023
[64]

Non-local neural networks

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018

work page 2018
[65]

Convnext v2: Co-designing and scaling convnets with masked autoencoders

Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Convnext v2: Co-designing and scaling convnets with masked autoencoders. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16133–16142, 2023

work page 2023
[66]

Pvg: Progressive vision graph for vision recognition

JiaFu Wu, Jian Li, Jiangning Zhang, Boshen Zhang, Mingmin Chi, Yabiao Wang, and Chengjie Wang. Pvg: Progressive vision graph for vision recognition. InProceedings of the 31st ACM international conference on multimedia, pages 2477–2486, 2023

work page 2023
[67]

Vision transformer with deformable attention

Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, and Gao Huang. Vision transformer with deformable attention. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4794–4803, 2022

work page 2022
[68]

Spatial-mamba: Effective visual state space models via structure-aware state fusion

Chaodong Xiao, Minghan Li, Zhengqiang ZHANG, Deyu Meng, and Lei Zhang. Spatial-mamba: Effective visual state space models via structure-aware state fusion. InThe Thirteenth International Conference on Learning Representations, 2025. 12

work page 2025
[69]

Unified perceptual parsing for scene understanding

Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. InProceedings of the European Conference on Computer Vision (ECCV), 2018

work page 2018
[70]

Quadmamba: Learning quadtree-based selective scan for visual state space model

Fei Xie, Weijia Zhang, Zhongdao Wang, and Chao Ma. Quadmamba: Learning quadtree-based selective scan for visual state space model. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[71]

Pvmamba: Parallelizing vision mamba via dynamic state aggregation

Fei Xie, Zhongdao Wang, Weijia Zhang, and Chao Ma. Pvmamba: Parallelizing vision mamba via dynamic state aggregation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10218–10228, 2025

work page 2025
[72]

Chenhongyi Yang, Zehui Chen, Miguel Espinosa, Linus Ericsson, Zhenyu Wang, Jiaming Liu, and Elliot J. Crowley. Plainmamba: Improving non-hierarchical mamba in visual recognition. In35th British Machine Vision Conference 2024, BMVC 2024, Glasgow, UK, November 25-28, 2024. BMV A, 2024

work page 2024
[73]

Focal self-attention for local-global interactions in vision transformers.arXiv preprint arXiv:2107.00641, 2021

Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. Focal self-attention for local-global interactions in vision transformers.arXiv preprint arXiv:2107.00641, 2021

work page arXiv 2021
[74]

Inceptionnext: When inception meets convnext

Weihao Yu, Pan Zhou, Shuicheng Yan, and Xinchao Wang. Inceptionnext: When inception meets convnext. InProceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 5672–5683, 2024

work page 2024
[75]

Objectness scan for efficient vision mamba.Knowledge-Based Systems, 330:114573, 2025

Kai Zhang, Xia Yuan, and Chunxia Zhao. Objectness scan for efficient vision mamba.Knowledge-Based Systems, 330:114573, 2025

work page 2025
[76]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5122–5130, 2017

work page 2017
[77]

Biformer: Vision transformer with bi-level routing attention

Lei Zhu, Xinjiang Wang, Zhanghan Ke, Wayne Zhang, and Rynson WH Lau. Biformer: Vision transformer with bi-level routing attention. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10323–10333, 2023

work page 2023
[78]

Vision mamba: Efficient visual representation learning with bidirectional state space model

Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. InForty-first International Conference on Machine Learning, 2024. 13 A Analysis of GraphScan-Induced Selective Scan This appendix provides a more detailed analysis of how GraphSca...

work page 2024