pith. sign in

arxiv: 2503.18970 · v3 · submitted 2025-03-22 · 💻 cs.LG

Advancing Intelligent Sequence Modeling: Evolution, Trade-offs, and Applications of State- Space Architectures from S4 to Mamba

Pith reviewed 2026-05-22 22:03 UTC · model grok-4.3

classification 💻 cs.LG
keywords structured state space modelsS4Mambasequence modelinglong-range dependencieslinear complexityhybrid architecturesefficient transformers
0
0 comments X

The pith

Structured state space models achieve linear scaling and outperform transformers on long-range sequence tasks across domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This review maps the progression of structured state space models from the original S4 design through Mamba, S5, and hybrid forms like Jamba. It establishes that these architectures replace the quadratic memory costs and sequential bottlenecks of earlier networks with linear-time recurrence grounded in state-space dynamics. The work shows concrete gains in speed and memory for tasks that require modeling dependencies over thousands of steps. A reader cares because the shift promises practical handling of long inputs in language, audio, images, and forecasting without the resource explosion seen in attention-based systems.

Core claim

By combining structured recurrence with state-space representations, SSMs deliver linear or near-linear computational scaling while capturing long-range dependencies more effectively than RNNs or transformers, with documented advantages in inference speed and memory use demonstrated from S4 through Mamba across NLP, speech, vision, and time-series applications.

What carries the argument

The structured state space sequence (S4) model and its selective extensions such as Mamba, which replace dense attention with parameterized linear state transitions to maintain constant memory and linear time complexity.

If this is right

  • Ultra-long sequences become feasible without quadratic memory growth, enabling direct modeling of entire documents or genomes.
  • Inference latency drops measurably, with reported reductions reaching 60 percent in real-time speech synthesis.
  • Hybrid SSM-transformer designs allow domain-specific tuning while preserving linear scaling in the dominant layers.
  • Resource-constrained settings gain access to state-of-the-art sequence performance previously limited to large GPU clusters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Continued hardware-aware redesign of the state transition matrices could widen the efficiency gap further on edge devices.
  • Interpretability tools developed for SSMs may transfer to other linear-time recurrent architectures.
  • If training instability issues are resolved, SSMs could replace transformers as the backbone for very long context windows in production systems.

Load-bearing premise

Performance advantages reported for SSMs in the surveyed literature represent their typical behavior rather than results selected from favorable tasks or implementations.

What would settle it

A single large-scale, domain-balanced benchmark in which transformer variants match or beat SSM variants on both accuracy and wall-clock efficiency for sequences longer than 10,000 tokens.

Figures

Figures reproduced from arXiv: 2503.18970 by Amir Rafe, Anandi Dutta, Gaurab Chhetri, Mahmuda Sultana Mimi, Md Monzurul Islam, Sazzad Bin Bashar Polock, Shriyank Somvanshi, Subasish Das.

Figure 1
Figure 1. Figure 1: Conceptual Representation of State Space Models, adapted from [2] [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: HiPPO Framework for Online Function Approximation, adapted from [35] [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: S4 Layer Structure adapted from [10] 3.2.2 Efficient Convolutions Replacing RNN-style Recurrence. S4 replaces traditional RNN-style recurrence with convolutional operations, allowing it to handle long sequences in a much more efficient manner [2]. While classical recurrent architectures update states sequentially, S4 instead applies a convolutional filter over the input sequence, which effectively replaces… view at source ↗
Figure 5
Figure 5. Figure 5: Architecture of Graph-Mamba adapted from [56] [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Architecture of MambaTS adapted from [60] [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Computational Components of the S5 Layer With Parallel Scan on a Diagonalized Linear SSM for [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Overview of the FusionMamba framework adapted from [76] [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
read the original abstract

Structured State Space Models (SSMs) have emerged as a transformative paradigm in sequence modeling, addressing critical limitations of Recurrent Neural Networks (RNNs) and Transformers, namely, vanishing gradients, sequential computation bottlenecks, and quadratic memory complexity. By integrating structured recurrence with state-space representations, SSMs achieve linear or near-linear computational scaling while excelling in long-range dependency tasks. This study systematically traces the evolution of SSMs from the foundational Structured State Space Sequence (S4) model to modern variants like Mamba, Simplified Structured State Space Sequence (S5), and Jamba, analyzing architectural innovations that enhance computational efficiency, memory optimization, and inference speed. We critically evaluate trade-offs inherent to SSM design, such as balancing expressiveness with computational constraints and integrating hybrid architectures for domain-specific performance. Across domains including natural language processing, speech recognition, computer vision, and time-series forecasting, SSMs demonstrate state-of-the-art results in handling ultra-long sequences, outperforming Transformer-based models in both speed and memory utilization. Case studies highlight applications such as real-time speech synthesis and genomic sequence modeling, where SSMs reduce inference latency by up to 60% compared to traditional approaches. However, challenges persist in training dynamics, interpretability, and hardware-aware optimization. We conclude with a forward-looking analysis of SSMs' potential to redefine scalable deep learning, proposing directions for hybrid systems, theoretical guarantees, and broader adoption in resource-constrained environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. This manuscript is a survey tracing the evolution of Structured State Space Models (SSMs) from the S4 model through variants including S5, Mamba, and Jamba. It describes how these architectures integrate structured recurrence and state-space representations to achieve linear or near-linear scaling, address vanishing gradients and quadratic complexity issues of prior models, and improve long-range dependency modeling. The paper reviews architectural innovations for efficiency and memory, evaluates design trade-offs, surveys applications and reported results across NLP, speech, vision, and time-series domains, presents case studies (including latency reductions), and outlines challenges and future directions for hybrid systems and theoretical work.

Significance. If the synthesis of cited results is accurate and balanced, the survey could provide a useful consolidated reference for researchers tracking the shift toward SSMs for long-sequence tasks. Its value would lie in organizing the progression of ideas and highlighting efficiency claims from the literature rather than introducing new empirical findings.

major comments (1)
  1. [Abstract] Abstract: The quantitative claim that SSMs 'reduce inference latency by up to 60% compared to traditional approaches' in case studies on speech synthesis and genomic modeling is presented without any citation to the specific studies or manuscript sections containing those results. In a survey, such assertions require traceable references to allow verification of representativeness.
minor comments (1)
  1. The abstract states that the paper 'critically evaluate[s] trade-offs' yet the provided text frames performance advantages primarily as descriptions of prior work; the main text should clarify the extent of original critical analysis versus summarization.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and recommendation of minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The quantitative claim that SSMs 'reduce inference latency by up to 60% compared to traditional approaches' in case studies on speech synthesis and genomic modeling is presented without any citation to the specific studies or manuscript sections containing those results. In a survey, such assertions require traceable references to allow verification of representativeness.

    Authors: We agree that the abstract's quantitative claim requires explicit traceability. The latency reduction figure is supported by the case studies and cited results discussed later in the manuscript (applications sections on speech and genomics). To ensure verifiability as a survey, we will revise the abstract to include direct citations to the relevant studies. This will be incorporated in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

This is a survey paper that traces the evolution of SSM architectures from S4 to Mamba by summarizing and citing prior literature on performance, scaling, and applications. No new derivations, equations, predictions, or fitted parameters are introduced that could reduce to self-referential constructions. Central claims about linear scaling and outperformance are presented as descriptions of existing results rather than internally derived quantities. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results occur in a way that makes the survey's content equivalent to its inputs by definition. The paper remains self-contained as a descriptive review without load-bearing circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper, no new free parameters, axioms, or invented entities are introduced by the authors.

pith-pipeline@v0.9.0 · 5836 in / 1056 out tokens · 30098 ms · 2026-05-22T22:03:56.984306+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation

    cs.LG 2025-09 unverdicted novelty 7.0

    Robust Filter Attention models self-attention as consistency-based state estimation under a linear SDE for token trajectories, matching standard attention complexity while showing lower perplexity and better zero-shot...

  2. RT-Transformer: The Transformer Block as a Spherical State Estimator

    cs.LG 2026-05 unverdicted novelty 6.0

    Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.

  3. HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment

    cs.CV 2026-04 unverdicted novelty 5.0

    HST-HGN uses heterogeneous spatial-temporal hypergraph networks combined with bidirectional Mamba state space models to achieve state-of-the-art driver fatigue assessment from untrimmed videos while maintaining comput...

  4. Deep Learning for Virtual Reality User Identification: A Benchmark

    cs.HC 2026-03 unverdicted novelty 4.0

    A benchmark study evaluates standard and emerging deep learning architectures on motion data from 71 VR users, establishing performance baselines for user identification.

  5. Tabular Data with Class Imbalance: Predicting Electric Vehicle Crash Severity with Pretrained Transformers (TabPFN) and Mamba-Based Models

    cs.LG 2025-09 unverdicted novelty 4.0

    Benchmarks TabPFN, MambaNet and MambaAttention on imbalanced EV crash severity classification with SMOTEENN resampling on Texas data, identifying intersection relation and speed limit as top features and MambaAttentio...

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 5 Pith papers · 13 internal anchors

  1. [1]

    Mamba-360: Survey of state space models as transformer alternative for long sequence modelling: Methods, applications, and challenges

    Badri Narayana Patro and Vijay Srinivas Agneeswaran. Mamba-360: Survey of state space models as transformer alternative for long sequence modelling: Methods, applications, and challenges. arXiv preprint arXiv:2404.16112, 2024

  2. [2]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021

  3. [3]

    Stablessm: Alleviating the curse of memory in state-space models through stable reparameterization

    Shida Wang and Qianxiao Li. Stablessm: Alleviating the curse of memory in state-space models through stable reparameterization. arXiv preprint arXiv:2311.14495, 2023

  4. [4]

    Efficient long sequence modeling via state space augmented transformer

    Simiao Zuo, Xiaodong Liu, Jian Jiao, Denis Charles, Eren Manavoglu, Tuo Zhao, and Jianfeng Gao. Efficient long sequence modeling via state space augmented transformer. arXiv preprint arXiv:2212.08136, 2022

  5. [5]

    Applying tabular deep learning models to estimate crash injury types of young motorcyclists

    Shriyank Somvanshi, Anannya Ghosh Tusti, Subasish Das, and Rohit Chakraborty. Applying tabular deep learning models to estimate crash injury types of young motorcyclists. In IEEE CAI, Santa Clara, California, USA, May 5-7 2025

  6. [6]

    Crash severity analysis of child bicyclists using arm-net and mambanet

    Shriyank Somvanshi, Rohit Chakraborty, Anandi K Dutta, and Subasish Das. Crash severity analysis of child bicyclists using arm-net and mambanet. In IEEE CAI, Santa Clara, California, USA, May 5-7 2025

  7. [7]

    Mathematical formalism for memory compression in selective state space models

    Siddhanth Bhat. Mathematical formalism for memory compression in selective state space models. arXiv preprint arXiv:2410.03158, 2024

  8. [8]

    Theoretical foundations of deep selective state-space models

    Nicola Muca Cirone, Antonio Orvieto, Benjamin Walker, Cristopher Salvi, and Terry Lyons. Theoretical foundations of deep selective state-space models. Advances in Neural Information Processing Systems , 37:127226–127272, 2024

  9. [9]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060, 2024

  10. [10]

    Simplified State Space Layers for Sequence Modeling

    Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022

  11. [11]

    Exploring the capability of mamba in speech applications

    Koichi Miyazaki, Yoshiki Masuyama, and Masato Murata. Exploring the capability of mamba in speech applications. arXiv preprint arXiv:2406.16808, 2024

  12. [12]

    A comprehensive survey of mamba architectures for medical image analysis: Classifi- cation, segmentation, restoration and beyond

    Shubhi Bansal, Sreekanth Madisetty, Mohammad Zia Ur Rehman, Chandravardhan Singh Raghaw, Gaurav Duggal, Nagendra Kumar, et al. A comprehensive survey of mamba architectures for medical image analysis: Classification, segmentation, restoration and beyond. arXiv preprint arXiv:2410.02362, 2024

  13. [13]

    Jamba-1.5: Hybrid transformer-mamba models at scale

    Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, et al. Jamba-1.5: Hybrid transformer-mamba models at scale.arXiv preprint arXiv:2408.12570, 2024

  14. [14]

    State space model for new-generation network alternative to transformers: A survey

    Xiao Wang, Shiao Wang, Yuhe Ding, Yuehang Li, Wentao Wu, Yao Rong, Weizhe Kong, Ju Huang, Shihao Li, Haoxiang Yang, et al. State space model for new-generation network alternative to transformers: A survey. arXiv preprint arXiv:2404.09516, 2024. J. ACM, Vol. , No. , Article . Publication date: March 2025. 28 Somvanshi et al

  15. [15]

    Analysis and control of nonlinear process systems

    Katalin M Hangos, József Bokor, and Gábor Szederkényi. Analysis and control of nonlinear process systems . Springer Science & Business Media, 2006

  16. [16]

    A new approach to linear filtering and prediction problems

    Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering, 82(1):35–45, 1960

  17. [17]

    Time series analysis

    James D Hamilton. Time series analysis. Princeton university press, 2020

  18. [18]

    Linear systems, volume 156

    Thomas Kailath. Linear systems, volume 156. Prentice-Hall Englewood Cliffs, NJ, 1980

  19. [19]

    Combining recurrent, convolutional, and continuous-time models with linear state space layers

    Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021

  20. [20]

    Finding structure in time

    Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990

  21. [21]

    Serial order: A parallel distributed processing approach

    Michael I Jordan. Serial order: A parallel distributed processing approach. In Advances in psychology, volume 121, pages 471–495. Elsevier, 1997

  22. [22]

    Long short-term memory

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

  23. [23]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

  24. [24]

    Backpropagation applied to handwritten zip code recognition

    Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989

  25. [25]

    Gradient-based learning applied to document recogni- tion

    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recogni- tion. Proceedings of the IEEE , 86(11):2278–2324, 1998

  26. [26]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems , 30, 2017

  27. [27]

    A comparison of lstm and gru networks for learning symbolic sequences

    Roberto Cahuantzi, Xinye Chen, and Stefan Güttel. A comparison of lstm and gru networks for learning symbolic sequences. In Science and Information Conference , pages 771–785. Springer, 2023

  28. [28]

    Using cnn for facial expression recognition: a study of the effects of kernel size and number of filters on accuracy

    Abhinav Agrawal and Namita Mittal. Using cnn for facial expression recognition: a study of the effects of kernel size and number of filters on accuracy. The Visual Computer, 36(2):405–412, 2020

  29. [29]

    Theoretical foundations of deep selective state-space models

    Nicola Muca Cirone, Antonio Orvieto, Benjamin Walker, Cristopher Salvi, and Terry Lyons. Theoretical foundations of deep selective state-space models. arXiv preprint arXiv:2402.19047, 2024

  30. [30]

    Dygmamba: Efficiently modeling long-term temporal dependency on continuous-time dynamic graphs with state space models

    Zifeng Ding, Yifeng Li, Yuan He, Antonio Norelli, Jingcheng Wu, Volker Tresp, Yunpu Ma, and Michael Bronstein. Dygmamba: Efficiently modeling long-term temporal dependency on continuous-time dynamic graphs with state space models. arXiv preprint arXiv:2408.04713, 2024

  31. [31]

    Learning long-term dependencies with gradient descent is difficult

    Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks , 5(2):157–166, 1994

  32. [32]

    How to train your hippo: State space models with generalized orthogonal basis projections

    Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, and Christopher Ré. How to train your hippo: State space models with generalized orthogonal basis projections. arXiv preprint arXiv:2206.12037, 2022

  33. [33]

    Theory for the user

    Lennart Ljung et al. Theory for the user. System identification, 1987

  34. [34]

    Neural ordinary differential equations

    Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems , 31, 2018

  35. [35]

    Hippo: Recurrent memory with optimal polynomial projections

    Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems , 33:1474–1487, 2020

  36. [36]

    Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

    Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014

  37. [37]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning , pages 5156–5165. PMLR, 2020

  38. [38]

    Efficient transformers: A survey

    Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022

  39. [39]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  40. [40]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF international conference on computer vision , pages 1728–1738, 2021

  41. [41]

    Temporal fusion transformers for interpretable multi- horizon time series forecasting

    Bryan Lim, Sercan Ö Arık, Nicolas Loeff, and Tomas Pfister. Temporal fusion transformers for interpretable multi- horizon time series forecasting. International Journal of Forecasting , 37(4):1748–1764, 2021

  42. [42]

    Speech recognition with deep recurrent neural networks

    Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing , pages 6645–6649. Ieee, 2013

  43. [43]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. J. ACM, Vol. , No. , Article . Publication date: March 2025. From S4 to Mamba: A Comprehensi...

  44. [44]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

  45. [45]

    Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

    Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019

  46. [46]

    Linformer: Self-Attention with Linear Complexity

    Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020

  47. [47]

    Modeling multivariate biosignals with graph neural networks and structured state space models

    Siyi Tang, Jared A Dunnmon, Qu Liangqiong, Khaled K Saab, Tina Baykaner, Christopher Lee-Messer, and Daniel L Rubin. Modeling multivariate biosignals with graph neural networks and structured state space models. In Conference on health, inference, and learning , pages 50–71. PMLR, 2023

  48. [48]

    Spectral Normalization for Generative Adversarial Networks

    Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018

  49. [49]

    Convolutional state space models for long-range spatiotemporal modeling

    Jimmy Smith, Shalini De Mello, Jan Kautz, Scott Linderman, and Wonmin Byeon. Convolutional state space models for long-range spatiotemporal modeling. Advances in Neural Information Processing Systems , 36:80690–80729, 2023

  50. [50]

    Highly accurate protein structure prediction with alphafold

    John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunya- suvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. nature, 596(7873):583–589, 2021

  51. [51]

    Chain of agents: Large language models collaborating on long-context tasks

    Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan Arik. Chain of agents: Large language models collaborating on long-context tasks. Advances in Neural Information Processing Systems , 37:132208–132237, 2024

  52. [52]

    Gpipe: Efficient training of giant neural networks using pipeline parallelism

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems , 32, 2019

  53. [53]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning , pages 28492–28518. PMLR, 2023

  54. [54]

    Long range arena: A benchmark for efficient transformers

    Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020

  55. [55]

    Spectral normalisation for deep reinforcement learning: an optimisation perspective

    Florin Gogianu, Tudor Berariu, Mihaela C Rosca, Claudia Clopath, Lucian Busoniu, and Razvan Pascanu. Spectral normalisation for deep reinforcement learning: an optimisation perspective. In International Conference on Machine Learning, pages 3734–3744. PMLR, 2021

  56. [56]

    Graph-mamba: Towards long-range graph sequence modeling with se- lective state spaces.arXiv preprint arXiv:2402.00789,

    Chloe Wang, Oleksii Tsepa, Jun Ma, and Bo Wang. Graph-mamba: Towards long-range graph sequence modeling with selective state spaces. arXiv preprint arXiv:2402.00789, 2024

  57. [57]

    Computation-efficient era: A comprehensive survey of state space models in medical image analysis

    Moein Heidari, Sina Ghorbani Kolahi, Sanaz Karimijafarbigloo, Bobby Azad, Afshin Bozorgpour, Soheila Hatami, Reza Azad, Ali Diba, Ulas Bagci, Dorit Merhof, et al. Computation-efficient era: A comprehensive survey of state space models in medical image analysis. arXiv preprint arXiv:2406.03430, 2024

  58. [58]

    A survey on visual mamba

    Hanwei Zhang, Ying Zhu, Dan Wang, Lijun Zhang, Tianxiang Chen, Ziyang Wang, and Zi Ye. A survey on visual mamba. Applied Sciences, 14(13):5683, 2024

  59. [59]

    The hidden attention of mamba models

    Ameen Ali, Itamar Zimerman, and Lior Wolf. The hidden attention of mamba models. arXiv preprint arXiv:2403.01590, 2024

  60. [60]

    Mambats: improved selective state space models for long-term time series forecasting

    Xiuding Cai, Yaoyao Zhu, Xueyao Wang, and Yu Yao. Mambats: improved selective state space models for long-term time series forecasting. arXiv preprint arXiv:2405.16440, 2024

  61. [61]

    Coupled mamba: Enhanced multi-modal fusion with coupled state space model

    Wenbing Li, Hang Zhou, Junqing Yu, Zikai Song, and Wei Yang. Coupled mamba: Enhanced multi-modal fusion with coupled state space model. arXiv preprint arXiv:2405.18014, 2024

  62. [62]

    Vl-mamba: Exploring state space models for multimodal learning

    Yanyuan Qiao, Zheng Yu, Longteng Guo, Sihan Chen, Zijia Zhao, Mingzhen Sun, Qi Wu, and Jing Liu. Vl-mamba: Exploring state space models for multimodal learning. arXiv preprint arXiv:2403.13600, 2024

  63. [63]

    Kalmamba: Towards efficient probabilistic state space models for rl under uncertainty

    Philipp Becker, Niklas Freymuth, and Gerhard Neumann. Kalmamba: Towards efficient probabilistic state space models for rl under uncertainty. arXiv preprint arXiv:2406.15131, 2024

  64. [64]

    Simba: Simplified mamba-based architecture for vision and multivariate time series

    Badri N Patro and Vijay S Agneeswaran. Simba: Simplified mamba-based architecture for vision and multivariate time series. arXiv preprint arXiv:2403.15360, 2024

  65. [65]

    Jamba: A Hybrid Transformer-Mamba Language Model

    Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024

  66. [66]

    Multilingual state space models for structured question answering in indic languages

    Arpita Vats, Rahul Raja, Mrinal Mathur, Vinija Jain, and Aman Chadha. Multilingual state space models for structured question answering in indic languages. arXiv preprint arXiv:2502.01673, 2025

  67. [67]

    Zamba: A compact 7B SSM hybrid model,

    Paolo Glorioso, Minghan He, Yehonatan Rozen, Alex Kuefler, Omer Lieber, Brendan Millidge, Peter Battaglia, Aran Komatsuzaki, Aäron van den Oord, Alex Graves, et al. Zamba: A compact 7b ssm hybrid model. arXiv preprint J. ACM, Vol. , No. , Article . Publication date: March 2025. 30 Somvanshi et al. arXiv:2405.16712, 2024

  68. [68]

    S4nd: Modeling images and videos as multidimensional signals with state spaces

    Eric Nguyen, Karan Goel, Albert Gu, Gordon Downs, Preey Shah, Tri Dao, Stephen Baccus, and Christopher Ré. S4nd: Modeling images and videos as multidimensional signals with state spaces. Advances in neural information processing systems, 35:2846–2861, 2022

  69. [69]

    Diagonal state spaces are as effective as structured state spaces.Advances in Neural Information Processing Systems , 35:22982–22994, 2022

    Ankit Gupta, Albert Gu, and Jonathan Berant. Diagonal state spaces are as effective as structured state spaces.Advances in Neural Information Processing Systems , 35:22982–22994, 2022

  70. [70]

    Liquid structural state-space models

    Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, and Daniela Rus. Liquid structural state-space models. arXiv preprint arXiv:2209.12951, 2022

  71. [71]

    Hyena hierarchy: Towards larger convolutional language models

    Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning, pages 28043–28078. PMLR, 2023

  72. [72]

    On the parameterization and initialization of diagonal state space models

    Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré. On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems , 35:35971–35983, 2022

  73. [73]

    Mega: Moving average equipped gated attention

    Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettle- moyer. Mega: Moving average equipped gated attention. arXiv preprint arXiv:2209.10655, 2022

  74. [74]

    RWKV: Reinventing RNNs for the Transformer Era

    Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023

  75. [75]

    J. Ma, F. Li, and B. Wang. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint, arXiv:2401.04722, 2024

  76. [76]

    Fusionmamba: Dynamic feature enhancement for multimodal image fusion with mamba

    Xinyu Xie, Yawen Cui, Tao Tan, Xubin Zheng, and Zitong Yu. Fusionmamba: Dynamic feature enhancement for multimodal image fusion with mamba. Visual Intelligence, 2(1):37, 2024

  77. [77]

    A survey of rwkv

    Zhiyuan Li, Tingyu Xia, Yi Chang, and Yuan Wu. A survey of rwkv. arXiv preprint arXiv:2412.14847, 2024

  78. [78]

    Linear recurrent units for sequential recommendation

    Zhenrui Yue, Yueqi Wang, Zhankui He, Huimin Zeng, Julian McAuley, and Dong Wang. Linear recurrent units for sequential recommendation. In Proceedings of the 17th ACM international conference on web search and data mining , pages 930–938, 2024

  79. [79]

    Soft hierarchical graph recurrent networks for many-agent partially observable environments

    Zhenhui Ye, Xiaohong Jiang, Guanghua Song, and Bowei Yang. Soft hierarchical graph recurrent networks for many-agent partially observable environments. arXiv preprint arXiv:2109.02032, 2021

  80. [80]

    Soft-hgrns: soft hierarchical graph recurrent networks for multi-agent partially observable environments

    Yixiang Ren, Zhenhui Ye, Yining Chen, Xiaohong Jiang, and Guanghua Song. Soft-hgrns: soft hierarchical graph recurrent networks for multi-agent partially observable environments. Frontiers of Information Technology & Electronic Engineering, 24(1):117–130, 2023

Showing first 80 references.