Advancing Intelligent Sequence Modeling: Evolution, Trade-offs, and Applications of State- Space Architectures from S4 to Mamba

Amir Rafe; Anandi Dutta; Gaurab Chhetri; Mahmuda Sultana Mimi; Md Monzurul Islam; Sazzad Bin Bashar Polock; Shriyank Somvanshi; Subasish Das

arxiv: 2503.18970 · v3 · submitted 2025-03-22 · 💻 cs.LG

Advancing Intelligent Sequence Modeling: Evolution, Trade-offs, and Applications of State- Space Architectures from S4 to Mamba

Shriyank Somvanshi , Md Monzurul Islam , Mahmuda Sultana Mimi , Sazzad Bin Bashar Polock , Gaurab Chhetri , Anandi Dutta , Amir Rafe , Subasish Das This is my paper

Pith reviewed 2026-05-22 22:03 UTC · model grok-4.3

classification 💻 cs.LG

keywords structured state space modelsS4Mambasequence modelinglong-range dependencieslinear complexityhybrid architecturesefficient transformers

0 comments

The pith

Structured state space models achieve linear scaling and outperform transformers on long-range sequence tasks across domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This review maps the progression of structured state space models from the original S4 design through Mamba, S5, and hybrid forms like Jamba. It establishes that these architectures replace the quadratic memory costs and sequential bottlenecks of earlier networks with linear-time recurrence grounded in state-space dynamics. The work shows concrete gains in speed and memory for tasks that require modeling dependencies over thousands of steps. A reader cares because the shift promises practical handling of long inputs in language, audio, images, and forecasting without the resource explosion seen in attention-based systems.

Core claim

By combining structured recurrence with state-space representations, SSMs deliver linear or near-linear computational scaling while capturing long-range dependencies more effectively than RNNs or transformers, with documented advantages in inference speed and memory use demonstrated from S4 through Mamba across NLP, speech, vision, and time-series applications.

What carries the argument

The structured state space sequence (S4) model and its selective extensions such as Mamba, which replace dense attention with parameterized linear state transitions to maintain constant memory and linear time complexity.

If this is right

Ultra-long sequences become feasible without quadratic memory growth, enabling direct modeling of entire documents or genomes.
Inference latency drops measurably, with reported reductions reaching 60 percent in real-time speech synthesis.
Hybrid SSM-transformer designs allow domain-specific tuning while preserving linear scaling in the dominant layers.
Resource-constrained settings gain access to state-of-the-art sequence performance previously limited to large GPU clusters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Continued hardware-aware redesign of the state transition matrices could widen the efficiency gap further on edge devices.
Interpretability tools developed for SSMs may transfer to other linear-time recurrent architectures.
If training instability issues are resolved, SSMs could replace transformers as the backbone for very long context windows in production systems.

Load-bearing premise

Performance advantages reported for SSMs in the surveyed literature represent their typical behavior rather than results selected from favorable tasks or implementations.

What would settle it

A single large-scale, domain-balanced benchmark in which transformer variants match or beat SSM variants on both accuracy and wall-clock efficiency for sequences longer than 10,000 tokens.

Figures

Figures reproduced from arXiv: 2503.18970 by Amir Rafe, Anandi Dutta, Gaurab Chhetri, Mahmuda Sultana Mimi, Md Monzurul Islam, Sazzad Bin Bashar Polock, Shriyank Somvanshi, Subasish Das.

**Figure 3.** Figure 3: HiPPO Framework for Online Function Approximation, adapted from [35] [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: S4 Layer Structure adapted from [10] 3.2.2 Efficient Convolutions Replacing RNN-style Recurrence. S4 replaces traditional RNN-style recurrence with convolutional operations, allowing it to handle long sequences in a much more efficient manner [2]. While classical recurrent architectures update states sequentially, S4 instead applies a convolutional filter over the input sequence, which effectively replaces… view at source ↗

**Figure 5.** Figure 5: Architecture of Graph-Mamba adapted from [56] [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Architecture of MambaTS adapted from [60] [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Computational Components of the S5 Layer With Parallel Scan on a Diagonalized Linear SSM for [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Overview of the FusionMamba framework adapted from [76] [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

read the original abstract

Structured State Space Models (SSMs) have emerged as a transformative paradigm in sequence modeling, addressing critical limitations of Recurrent Neural Networks (RNNs) and Transformers, namely, vanishing gradients, sequential computation bottlenecks, and quadratic memory complexity. By integrating structured recurrence with state-space representations, SSMs achieve linear or near-linear computational scaling while excelling in long-range dependency tasks. This study systematically traces the evolution of SSMs from the foundational Structured State Space Sequence (S4) model to modern variants like Mamba, Simplified Structured State Space Sequence (S5), and Jamba, analyzing architectural innovations that enhance computational efficiency, memory optimization, and inference speed. We critically evaluate trade-offs inherent to SSM design, such as balancing expressiveness with computational constraints and integrating hybrid architectures for domain-specific performance. Across domains including natural language processing, speech recognition, computer vision, and time-series forecasting, SSMs demonstrate state-of-the-art results in handling ultra-long sequences, outperforming Transformer-based models in both speed and memory utilization. Case studies highlight applications such as real-time speech synthesis and genomic sequence modeling, where SSMs reduce inference latency by up to 60% compared to traditional approaches. However, challenges persist in training dynamics, interpretability, and hardware-aware optimization. We conclude with a forward-looking analysis of SSMs' potential to redefine scalable deep learning, proposing directions for hybrid systems, theoretical guarantees, and broader adoption in resource-constrained environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a survey summarizing SSM evolution from S4 to Mamba with no new results or analysis.

read the letter

This paper is a survey that summarizes the progression of state-space models in sequence modeling. It covers the move from S4 to Mamba and related models like S5 and Jamba, focusing on how they achieve linear scaling and handle long sequences better than transformers in some cases. What the paper does well is provide a single document that traces the evolution and lists the main innovations in each step. It also points to applications in speech, vision, and forecasting, which gives a sense of the breadth. If you need a starting point to understand why people are excited about these models, this could work. The soft spots are more noticeable. Since it is a review, there are no new results, and the performance advantages are just restated from the original papers. The abstract talks about state-of-the-art results and specific improvements like 60% lower latency, but without independent verification or discussion of conflicting results, it is hard to know how balanced the picture is. The challenges section is brief and does not dig into why training dynamics remain difficult or how to address interpretability. The math and data sections are not applicable here because there is no original math or data; it is all descriptive. The citations seem to hit the key papers, but a stronger review would include more on the theoretical foundations or comparisons across benchmarks. This kind of paper is for people who want an overview rather than deep technical details. It might be helpful for a reading group or for students, but it does not have enough original content to justify peer review in a top venue. I would not recommend putting it through peer review.

Referee Report

1 major / 1 minor

Summary. This manuscript is a survey tracing the evolution of Structured State Space Models (SSMs) from the S4 model through variants including S5, Mamba, and Jamba. It describes how these architectures integrate structured recurrence and state-space representations to achieve linear or near-linear scaling, address vanishing gradients and quadratic complexity issues of prior models, and improve long-range dependency modeling. The paper reviews architectural innovations for efficiency and memory, evaluates design trade-offs, surveys applications and reported results across NLP, speech, vision, and time-series domains, presents case studies (including latency reductions), and outlines challenges and future directions for hybrid systems and theoretical work.

Significance. If the synthesis of cited results is accurate and balanced, the survey could provide a useful consolidated reference for researchers tracking the shift toward SSMs for long-sequence tasks. Its value would lie in organizing the progression of ideas and highlighting efficiency claims from the literature rather than introducing new empirical findings.

major comments (1)

[Abstract] Abstract: The quantitative claim that SSMs 'reduce inference latency by up to 60% compared to traditional approaches' in case studies on speech synthesis and genomic modeling is presented without any citation to the specific studies or manuscript sections containing those results. In a survey, such assertions require traceable references to allow verification of representativeness.

minor comments (1)

The abstract states that the paper 'critically evaluate[s] trade-offs' yet the provided text frames performance advantages primarily as descriptions of prior work; the main text should clarify the extent of original critical analysis versus summarization.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and recommendation of minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The quantitative claim that SSMs 'reduce inference latency by up to 60% compared to traditional approaches' in case studies on speech synthesis and genomic modeling is presented without any citation to the specific studies or manuscript sections containing those results. In a survey, such assertions require traceable references to allow verification of representativeness.

Authors: We agree that the abstract's quantitative claim requires explicit traceability. The latency reduction figure is supported by the case studies and cited results discussed later in the manuscript (applications sections on speech and genomics). To ensure verifiability as a survey, we will revise the abstract to include direct citations to the relevant studies. This will be incorporated in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

This is a survey paper that traces the evolution of SSM architectures from S4 to Mamba by summarizing and citing prior literature on performance, scaling, and applications. No new derivations, equations, predictions, or fitted parameters are introduced that could reduce to self-referential constructions. Central claims about linear scaling and outperformance are presented as descriptions of existing results rather than internally derived quantities. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results occur in a way that makes the survey's content equivalent to its inputs by definition. The paper remains self-contained as a descriptive review without load-bearing circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper, no new free parameters, axioms, or invented entities are introduced by the authors.

pith-pipeline@v0.9.0 · 5836 in / 1056 out tokens · 30098 ms · 2026-05-22T22:03:56.984306+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation
cs.LG 2025-09 unverdicted novelty 7.0

Robust Filter Attention models self-attention as consistency-based state estimation under a linear SDE for token trajectories, matching standard attention complexity while showing lower perplexity and better zero-shot...
RT-Transformer: The Transformer Block as a Spherical State Estimator
cs.LG 2026-05 unverdicted novelty 6.0

Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.
HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment
cs.CV 2026-04 unverdicted novelty 5.0

HST-HGN uses heterogeneous spatial-temporal hypergraph networks combined with bidirectional Mamba state space models to achieve state-of-the-art driver fatigue assessment from untrimmed videos while maintaining comput...
Deep Learning for Virtual Reality User Identification: A Benchmark
cs.HC 2026-03 unverdicted novelty 4.0

A benchmark study evaluates standard and emerging deep learning architectures on motion data from 71 VR users, establishing performance baselines for user identification.
Tabular Data with Class Imbalance: Predicting Electric Vehicle Crash Severity with Pretrained Transformers (TabPFN) and Mamba-Based Models
cs.LG 2025-09 unverdicted novelty 4.0

Benchmarks TabPFN, MambaNet and MambaAttention on imbalanced EV crash severity classification with SMOTEENN resampling on Texas data, identifying intersection relation and speed limit as top features and MambaAttentio...

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 5 Pith papers · 13 internal anchors

[1]

Mamba-360: Survey of state space models as transformer alternative for long sequence modelling: Methods, applications, and challenges

Badri Narayana Patro and Vijay Srinivas Agneeswaran. Mamba-360: Survey of state space models as transformer alternative for long sequence modelling: Methods, applications, and challenges. arXiv preprint arXiv:2404.16112, 2024

work page arXiv 2024
[2]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Stablessm: Alleviating the curse of memory in state-space models through stable reparameterization

Shida Wang and Qianxiao Li. Stablessm: Alleviating the curse of memory in state-space models through stable reparameterization. arXiv preprint arXiv:2311.14495, 2023

work page arXiv 2023
[4]

Efficient long sequence modeling via state space augmented transformer

Simiao Zuo, Xiaodong Liu, Jian Jiao, Denis Charles, Eren Manavoglu, Tuo Zhao, and Jianfeng Gao. Efficient long sequence modeling via state space augmented transformer. arXiv preprint arXiv:2212.08136, 2022

work page arXiv 2022
[5]

Applying tabular deep learning models to estimate crash injury types of young motorcyclists

Shriyank Somvanshi, Anannya Ghosh Tusti, Subasish Das, and Rohit Chakraborty. Applying tabular deep learning models to estimate crash injury types of young motorcyclists. In IEEE CAI, Santa Clara, California, USA, May 5-7 2025

work page 2025
[6]

Crash severity analysis of child bicyclists using arm-net and mambanet

Shriyank Somvanshi, Rohit Chakraborty, Anandi K Dutta, and Subasish Das. Crash severity analysis of child bicyclists using arm-net and mambanet. In IEEE CAI, Santa Clara, California, USA, May 5-7 2025

work page 2025
[7]

Mathematical formalism for memory compression in selective state space models

Siddhanth Bhat. Mathematical formalism for memory compression in selective state space models. arXiv preprint arXiv:2410.03158, 2024

work page arXiv 2024
[8]

Theoretical foundations of deep selective state-space models

Nicola Muca Cirone, Antonio Orvieto, Benjamin Walker, Cristopher Salvi, and Terry Lyons. Theoretical foundations of deep selective state-space models. Advances in Neural Information Processing Systems , 37:127226–127272, 2024

work page 2024
[9]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Simplified State Space Layers for Sequence Modeling

Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Exploring the capability of mamba in speech applications

Koichi Miyazaki, Yoshiki Masuyama, and Masato Murata. Exploring the capability of mamba in speech applications. arXiv preprint arXiv:2406.16808, 2024

work page arXiv 2024
[12]

A comprehensive survey of mamba architectures for medical image analysis: Classifi- cation, segmentation, restoration and beyond

Shubhi Bansal, Sreekanth Madisetty, Mohammad Zia Ur Rehman, Chandravardhan Singh Raghaw, Gaurav Duggal, Nagendra Kumar, et al. A comprehensive survey of mamba architectures for medical image analysis: Classification, segmentation, restoration and beyond. arXiv preprint arXiv:2410.02362, 2024

work page arXiv 2024
[13]

Jamba-1.5: Hybrid transformer-mamba models at scale

Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, et al. Jamba-1.5: Hybrid transformer-mamba models at scale.arXiv preprint arXiv:2408.12570, 2024

work page arXiv 2024
[14]

State space model for new-generation network alternative to transformers: A survey

Xiao Wang, Shiao Wang, Yuhe Ding, Yuehang Li, Wentao Wu, Yao Rong, Weizhe Kong, Ju Huang, Shihao Li, Haoxiang Yang, et al. State space model for new-generation network alternative to transformers: A survey. arXiv preprint arXiv:2404.09516, 2024. J. ACM, Vol. , No. , Article . Publication date: March 2025. 28 Somvanshi et al

work page arXiv 2024
[15]

Analysis and control of nonlinear process systems

Katalin M Hangos, József Bokor, and Gábor Szederkényi. Analysis and control of nonlinear process systems . Springer Science & Business Media, 2006

work page 2006
[16]

A new approach to linear filtering and prediction problems

Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering, 82(1):35–45, 1960

work page 1960
[17]

Time series analysis

James D Hamilton. Time series analysis. Princeton university press, 2020

work page 2020
[18]

Linear systems, volume 156

Thomas Kailath. Linear systems, volume 156. Prentice-Hall Englewood Cliffs, NJ, 1980

work page 1980
[19]

Combining recurrent, convolutional, and continuous-time models with linear state space layers

Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021

work page 2021
[20]

Finding structure in time

Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990

work page 1990
[21]

Serial order: A parallel distributed processing approach

Michael I Jordan. Serial order: A parallel distributed processing approach. In Advances in psychology, volume 121, pages 471–495. Elsevier, 1997

work page 1997
[22]

Long short-term memory

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

work page 1997
[23]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Backpropagation applied to handwritten zip code recognition

Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989

work page 1989
[25]

Gradient-based learning applied to document recogni- tion

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recogni- tion. Proceedings of the IEEE , 86(11):2278–2324, 1998

work page 1998
[26]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems , 30, 2017

work page 2017
[27]

A comparison of lstm and gru networks for learning symbolic sequences

Roberto Cahuantzi, Xinye Chen, and Stefan Güttel. A comparison of lstm and gru networks for learning symbolic sequences. In Science and Information Conference , pages 771–785. Springer, 2023

work page 2023
[28]

Using cnn for facial expression recognition: a study of the effects of kernel size and number of filters on accuracy

Abhinav Agrawal and Namita Mittal. Using cnn for facial expression recognition: a study of the effects of kernel size and number of filters on accuracy. The Visual Computer, 36(2):405–412, 2020

work page 2020
[29]

Theoretical foundations of deep selective state-space models

Nicola Muca Cirone, Antonio Orvieto, Benjamin Walker, Cristopher Salvi, and Terry Lyons. Theoretical foundations of deep selective state-space models. arXiv preprint arXiv:2402.19047, 2024

work page arXiv 2024
[30]

Dygmamba: Efficiently modeling long-term temporal dependency on continuous-time dynamic graphs with state space models

Zifeng Ding, Yifeng Li, Yuan He, Antonio Norelli, Jingcheng Wu, Volker Tresp, Yunpu Ma, and Michael Bronstein. Dygmamba: Efficiently modeling long-term temporal dependency on continuous-time dynamic graphs with state space models. arXiv preprint arXiv:2408.04713, 2024

work page arXiv 2024
[31]

Learning long-term dependencies with gradient descent is difficult

Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks , 5(2):157–166, 1994

work page 1994
[32]

How to train your hippo: State space models with generalized orthogonal basis projections

Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, and Christopher Ré. How to train your hippo: State space models with generalized orthogonal basis projections. arXiv preprint arXiv:2206.12037, 2022

work page arXiv 2022
[33]

Theory for the user

Lennart Ljung et al. Theory for the user. System identification, 1987

work page 1987
[34]

Neural ordinary differential equations

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems , 31, 2018

work page 2018
[35]

Hippo: Recurrent memory with optimal polynomial projections

Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems , 33:1474–1487, 2020

work page 2020
[36]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[37]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning , pages 5156–5165. PMLR, 2020

work page 2020
[38]

Efficient transformers: A survey

Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022

work page 2022
[39]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[40]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF international conference on computer vision , pages 1728–1738, 2021

work page 2021
[41]

Temporal fusion transformers for interpretable multi- horizon time series forecasting

Bryan Lim, Sercan Ö Arık, Nicolas Loeff, and Tomas Pfister. Temporal fusion transformers for interpretable multi- horizon time series forecasting. International Journal of Forecasting , 37(4):1748–1764, 2021

work page 2021
[42]

Speech recognition with deep recurrent neural networks

Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing , pages 6645–6649. Ieee, 2013

work page 2013
[43]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. J. ACM, Vol. , No. , Article . Publication date: March 2025. From S4 to Mamba: A Comprehensi...

work page 2020
[44]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[45]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[46]

Linformer: Self-Attention with Linear Complexity

Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[47]

Modeling multivariate biosignals with graph neural networks and structured state space models

Siyi Tang, Jared A Dunnmon, Qu Liangqiong, Khaled K Saab, Tina Baykaner, Christopher Lee-Messer, and Daniel L Rubin. Modeling multivariate biosignals with graph neural networks and structured state space models. In Conference on health, inference, and learning , pages 50–71. PMLR, 2023

work page 2023
[48]

Spectral Normalization for Generative Adversarial Networks

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[49]

Convolutional state space models for long-range spatiotemporal modeling

Jimmy Smith, Shalini De Mello, Jan Kautz, Scott Linderman, and Wonmin Byeon. Convolutional state space models for long-range spatiotemporal modeling. Advances in Neural Information Processing Systems , 36:80690–80729, 2023

work page 2023
[50]

Highly accurate protein structure prediction with alphafold

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunya- suvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. nature, 596(7873):583–589, 2021

work page 2021
[51]

Chain of agents: Large language models collaborating on long-context tasks

Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan Arik. Chain of agents: Large language models collaborating on long-context tasks. Advances in Neural Information Processing Systems , 37:132208–132237, 2024

work page 2024
[52]

Gpipe: Efficient training of giant neural networks using pipeline parallelism

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems , 32, 2019

work page 2019
[53]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning , pages 28492–28518. PMLR, 2023

work page 2023
[54]

Long range arena: A benchmark for efficient transformers

Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020

work page arXiv 2011
[55]

Spectral normalisation for deep reinforcement learning: an optimisation perspective

Florin Gogianu, Tudor Berariu, Mihaela C Rosca, Claudia Clopath, Lucian Busoniu, and Razvan Pascanu. Spectral normalisation for deep reinforcement learning: an optimisation perspective. In International Conference on Machine Learning, pages 3734–3744. PMLR, 2021

work page 2021
[56]

Graph-mamba: Towards long-range graph sequence modeling with se- lective state spaces.arXiv preprint arXiv:2402.00789,

Chloe Wang, Oleksii Tsepa, Jun Ma, and Bo Wang. Graph-mamba: Towards long-range graph sequence modeling with selective state spaces. arXiv preprint arXiv:2402.00789, 2024

work page arXiv 2024
[57]

Computation-efficient era: A comprehensive survey of state space models in medical image analysis

Moein Heidari, Sina Ghorbani Kolahi, Sanaz Karimijafarbigloo, Bobby Azad, Afshin Bozorgpour, Soheila Hatami, Reza Azad, Ali Diba, Ulas Bagci, Dorit Merhof, et al. Computation-efficient era: A comprehensive survey of state space models in medical image analysis. arXiv preprint arXiv:2406.03430, 2024

work page arXiv 2024
[58]

A survey on visual mamba

Hanwei Zhang, Ying Zhu, Dan Wang, Lijun Zhang, Tianxiang Chen, Ziyang Wang, and Zi Ye. A survey on visual mamba. Applied Sciences, 14(13):5683, 2024

work page 2024
[59]

The hidden attention of mamba models

Ameen Ali, Itamar Zimerman, and Lior Wolf. The hidden attention of mamba models. arXiv preprint arXiv:2403.01590, 2024

work page arXiv 2024
[60]

Mambats: improved selective state space models for long-term time series forecasting

Xiuding Cai, Yaoyao Zhu, Xueyao Wang, and Yu Yao. Mambats: improved selective state space models for long-term time series forecasting. arXiv preprint arXiv:2405.16440, 2024

work page arXiv 2024
[61]

Coupled mamba: Enhanced multi-modal fusion with coupled state space model

Wenbing Li, Hang Zhou, Junqing Yu, Zikai Song, and Wei Yang. Coupled mamba: Enhanced multi-modal fusion with coupled state space model. arXiv preprint arXiv:2405.18014, 2024

work page arXiv 2024
[62]

Vl-mamba: Exploring state space models for multimodal learning

Yanyuan Qiao, Zheng Yu, Longteng Guo, Sihan Chen, Zijia Zhao, Mingzhen Sun, Qi Wu, and Jing Liu. Vl-mamba: Exploring state space models for multimodal learning. arXiv preprint arXiv:2403.13600, 2024

work page arXiv 2024
[63]

Kalmamba: Towards efficient probabilistic state space models for rl under uncertainty

Philipp Becker, Niklas Freymuth, and Gerhard Neumann. Kalmamba: Towards efficient probabilistic state space models for rl under uncertainty. arXiv preprint arXiv:2406.15131, 2024

work page arXiv 2024
[64]

Simba: Simplified mamba-based architecture for vision and multivariate time series

Badri N Patro and Vijay S Agneeswaran. Simba: Simplified mamba-based architecture for vision and multivariate time series. arXiv preprint arXiv:2403.15360, 2024

work page arXiv 2024
[65]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

Multilingual state space models for structured question answering in indic languages

Arpita Vats, Rahul Raja, Mrinal Mathur, Vinija Jain, and Aman Chadha. Multilingual state space models for structured question answering in indic languages. arXiv preprint arXiv:2502.01673, 2025

work page arXiv 2025
[67]

Zamba: A compact 7B SSM hybrid model,

Paolo Glorioso, Minghan He, Yehonatan Rozen, Alex Kuefler, Omer Lieber, Brendan Millidge, Peter Battaglia, Aran Komatsuzaki, Aäron van den Oord, Alex Graves, et al. Zamba: A compact 7b ssm hybrid model. arXiv preprint J. ACM, Vol. , No. , Article . Publication date: March 2025. 30 Somvanshi et al. arXiv:2405.16712, 2024

work page arXiv 2025
[68]

S4nd: Modeling images and videos as multidimensional signals with state spaces

Eric Nguyen, Karan Goel, Albert Gu, Gordon Downs, Preey Shah, Tri Dao, Stephen Baccus, and Christopher Ré. S4nd: Modeling images and videos as multidimensional signals with state spaces. Advances in neural information processing systems, 35:2846–2861, 2022

work page 2022
[69]

Diagonal state spaces are as effective as structured state spaces.Advances in Neural Information Processing Systems , 35:22982–22994, 2022

Ankit Gupta, Albert Gu, and Jonathan Berant. Diagonal state spaces are as effective as structured state spaces.Advances in Neural Information Processing Systems , 35:22982–22994, 2022

work page 2022
[70]

Liquid structural state-space models

Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, and Daniela Rus. Liquid structural state-space models. arXiv preprint arXiv:2209.12951, 2022

work page arXiv 2022
[71]

Hyena hierarchy: Towards larger convolutional language models

Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning, pages 28043–28078. PMLR, 2023

work page 2023
[72]

On the parameterization and initialization of diagonal state space models

Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré. On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems , 35:35971–35983, 2022

work page 2022
[73]

Mega: Moving average equipped gated attention

Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettle- moyer. Mega: Moving average equipped gated attention. arXiv preprint arXiv:2209.10655, 2022

work page arXiv 2022
[74]

RWKV: Reinventing RNNs for the Transformer Era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[75]

J. Ma, F. Li, and B. Wang. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint, arXiv:2401.04722, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

Fusionmamba: Dynamic feature enhancement for multimodal image fusion with mamba

Xinyu Xie, Yawen Cui, Tao Tan, Xubin Zheng, and Zitong Yu. Fusionmamba: Dynamic feature enhancement for multimodal image fusion with mamba. Visual Intelligence, 2(1):37, 2024

work page 2024
[77]

A survey of rwkv

Zhiyuan Li, Tingyu Xia, Yi Chang, and Yuan Wu. A survey of rwkv. arXiv preprint arXiv:2412.14847, 2024

work page arXiv 2024
[78]

Linear recurrent units for sequential recommendation

Zhenrui Yue, Yueqi Wang, Zhankui He, Huimin Zeng, Julian McAuley, and Dong Wang. Linear recurrent units for sequential recommendation. In Proceedings of the 17th ACM international conference on web search and data mining , pages 930–938, 2024

work page 2024
[79]

Soft hierarchical graph recurrent networks for many-agent partially observable environments

Zhenhui Ye, Xiaohong Jiang, Guanghua Song, and Bowei Yang. Soft hierarchical graph recurrent networks for many-agent partially observable environments. arXiv preprint arXiv:2109.02032, 2021

work page arXiv 2021
[80]

Soft-hgrns: soft hierarchical graph recurrent networks for multi-agent partially observable environments

Yixiang Ren, Zhenhui Ye, Yining Chen, Xiaohong Jiang, and Guanghua Song. Soft-hgrns: soft hierarchical graph recurrent networks for multi-agent partially observable environments. Frontiers of Information Technology & Electronic Engineering, 24(1):117–130, 2023

work page 2023

Showing first 80 references.

[1] [1]

Mamba-360: Survey of state space models as transformer alternative for long sequence modelling: Methods, applications, and challenges

Badri Narayana Patro and Vijay Srinivas Agneeswaran. Mamba-360: Survey of state space models as transformer alternative for long sequence modelling: Methods, applications, and challenges. arXiv preprint arXiv:2404.16112, 2024

work page arXiv 2024

[2] [2]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

Stablessm: Alleviating the curse of memory in state-space models through stable reparameterization

Shida Wang and Qianxiao Li. Stablessm: Alleviating the curse of memory in state-space models through stable reparameterization. arXiv preprint arXiv:2311.14495, 2023

work page arXiv 2023

[4] [4]

Efficient long sequence modeling via state space augmented transformer

Simiao Zuo, Xiaodong Liu, Jian Jiao, Denis Charles, Eren Manavoglu, Tuo Zhao, and Jianfeng Gao. Efficient long sequence modeling via state space augmented transformer. arXiv preprint arXiv:2212.08136, 2022

work page arXiv 2022

[5] [5]

Applying tabular deep learning models to estimate crash injury types of young motorcyclists

Shriyank Somvanshi, Anannya Ghosh Tusti, Subasish Das, and Rohit Chakraborty. Applying tabular deep learning models to estimate crash injury types of young motorcyclists. In IEEE CAI, Santa Clara, California, USA, May 5-7 2025

work page 2025

[6] [6]

Crash severity analysis of child bicyclists using arm-net and mambanet

Shriyank Somvanshi, Rohit Chakraborty, Anandi K Dutta, and Subasish Das. Crash severity analysis of child bicyclists using arm-net and mambanet. In IEEE CAI, Santa Clara, California, USA, May 5-7 2025

work page 2025

[7] [7]

Mathematical formalism for memory compression in selective state space models

Siddhanth Bhat. Mathematical formalism for memory compression in selective state space models. arXiv preprint arXiv:2410.03158, 2024

work page arXiv 2024

[8] [8]

Theoretical foundations of deep selective state-space models

Nicola Muca Cirone, Antonio Orvieto, Benjamin Walker, Cristopher Salvi, and Terry Lyons. Theoretical foundations of deep selective state-space models. Advances in Neural Information Processing Systems , 37:127226–127272, 2024

work page 2024

[9] [9]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Simplified State Space Layers for Sequence Modeling

Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Exploring the capability of mamba in speech applications

Koichi Miyazaki, Yoshiki Masuyama, and Masato Murata. Exploring the capability of mamba in speech applications. arXiv preprint arXiv:2406.16808, 2024

work page arXiv 2024

[12] [12]

A comprehensive survey of mamba architectures for medical image analysis: Classifi- cation, segmentation, restoration and beyond

Shubhi Bansal, Sreekanth Madisetty, Mohammad Zia Ur Rehman, Chandravardhan Singh Raghaw, Gaurav Duggal, Nagendra Kumar, et al. A comprehensive survey of mamba architectures for medical image analysis: Classification, segmentation, restoration and beyond. arXiv preprint arXiv:2410.02362, 2024

work page arXiv 2024

[13] [13]

Jamba-1.5: Hybrid transformer-mamba models at scale

Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, et al. Jamba-1.5: Hybrid transformer-mamba models at scale.arXiv preprint arXiv:2408.12570, 2024

work page arXiv 2024

[14] [14]

State space model for new-generation network alternative to transformers: A survey

Xiao Wang, Shiao Wang, Yuhe Ding, Yuehang Li, Wentao Wu, Yao Rong, Weizhe Kong, Ju Huang, Shihao Li, Haoxiang Yang, et al. State space model for new-generation network alternative to transformers: A survey. arXiv preprint arXiv:2404.09516, 2024. J. ACM, Vol. , No. , Article . Publication date: March 2025. 28 Somvanshi et al

work page arXiv 2024

[15] [15]

Analysis and control of nonlinear process systems

Katalin M Hangos, József Bokor, and Gábor Szederkényi. Analysis and control of nonlinear process systems . Springer Science & Business Media, 2006

work page 2006

[16] [16]

A new approach to linear filtering and prediction problems

Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering, 82(1):35–45, 1960

work page 1960

[17] [17]

Time series analysis

James D Hamilton. Time series analysis. Princeton university press, 2020

work page 2020

[18] [18]

Linear systems, volume 156

Thomas Kailath. Linear systems, volume 156. Prentice-Hall Englewood Cliffs, NJ, 1980

work page 1980

[19] [19]

Combining recurrent, convolutional, and continuous-time models with linear state space layers

Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021

work page 2021

[20] [20]

Finding structure in time

Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990

work page 1990

[21] [21]

Serial order: A parallel distributed processing approach

Michael I Jordan. Serial order: A parallel distributed processing approach. In Advances in psychology, volume 121, pages 471–495. Elsevier, 1997

work page 1997

[22] [22]

Long short-term memory

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

work page 1997

[23] [23]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Backpropagation applied to handwritten zip code recognition

Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989

work page 1989

[25] [25]

Gradient-based learning applied to document recogni- tion

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recogni- tion. Proceedings of the IEEE , 86(11):2278–2324, 1998

work page 1998

[26] [26]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems , 30, 2017

work page 2017

[27] [27]

A comparison of lstm and gru networks for learning symbolic sequences

Roberto Cahuantzi, Xinye Chen, and Stefan Güttel. A comparison of lstm and gru networks for learning symbolic sequences. In Science and Information Conference , pages 771–785. Springer, 2023

work page 2023

[28] [28]

Using cnn for facial expression recognition: a study of the effects of kernel size and number of filters on accuracy

Abhinav Agrawal and Namita Mittal. Using cnn for facial expression recognition: a study of the effects of kernel size and number of filters on accuracy. The Visual Computer, 36(2):405–412, 2020

work page 2020

[29] [29]

Theoretical foundations of deep selective state-space models

Nicola Muca Cirone, Antonio Orvieto, Benjamin Walker, Cristopher Salvi, and Terry Lyons. Theoretical foundations of deep selective state-space models. arXiv preprint arXiv:2402.19047, 2024

work page arXiv 2024

[30] [30]

Dygmamba: Efficiently modeling long-term temporal dependency on continuous-time dynamic graphs with state space models

Zifeng Ding, Yifeng Li, Yuan He, Antonio Norelli, Jingcheng Wu, Volker Tresp, Yunpu Ma, and Michael Bronstein. Dygmamba: Efficiently modeling long-term temporal dependency on continuous-time dynamic graphs with state space models. arXiv preprint arXiv:2408.04713, 2024

work page arXiv 2024

[31] [31]

Learning long-term dependencies with gradient descent is difficult

Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks , 5(2):157–166, 1994

work page 1994

[32] [32]

How to train your hippo: State space models with generalized orthogonal basis projections

Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, and Christopher Ré. How to train your hippo: State space models with generalized orthogonal basis projections. arXiv preprint arXiv:2206.12037, 2022

work page arXiv 2022

[33] [33]

Theory for the user

Lennart Ljung et al. Theory for the user. System identification, 1987

work page 1987

[34] [34]

Neural ordinary differential equations

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems , 31, 2018

work page 2018

[35] [35]

Hippo: Recurrent memory with optimal polynomial projections

Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems , 33:1474–1487, 2020

work page 2020

[36] [36]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[37] [37]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning , pages 5156–5165. PMLR, 2020

work page 2020

[38] [38]

Efficient transformers: A survey

Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022

work page 2022

[39] [39]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901

[40] [40]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF international conference on computer vision , pages 1728–1738, 2021

work page 2021

[41] [41]

Temporal fusion transformers for interpretable multi- horizon time series forecasting

Bryan Lim, Sercan Ö Arık, Nicolas Loeff, and Tomas Pfister. Temporal fusion transformers for interpretable multi- horizon time series forecasting. International Journal of Forecasting , 37(4):1748–1764, 2021

work page 2021

[42] [42]

Speech recognition with deep recurrent neural networks

Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing , pages 6645–6649. Ieee, 2013

work page 2013

[43] [43]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. J. ACM, Vol. , No. , Article . Publication date: March 2025. From S4 to Mamba: A Comprehensi...

work page 2020

[44] [44]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[45] [45]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[46] [46]

Linformer: Self-Attention with Linear Complexity

Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[47] [47]

Modeling multivariate biosignals with graph neural networks and structured state space models

Siyi Tang, Jared A Dunnmon, Qu Liangqiong, Khaled K Saab, Tina Baykaner, Christopher Lee-Messer, and Daniel L Rubin. Modeling multivariate biosignals with graph neural networks and structured state space models. In Conference on health, inference, and learning , pages 50–71. PMLR, 2023

work page 2023

[48] [48]

Spectral Normalization for Generative Adversarial Networks

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[49] [49]

Convolutional state space models for long-range spatiotemporal modeling

Jimmy Smith, Shalini De Mello, Jan Kautz, Scott Linderman, and Wonmin Byeon. Convolutional state space models for long-range spatiotemporal modeling. Advances in Neural Information Processing Systems , 36:80690–80729, 2023

work page 2023

[50] [50]

Highly accurate protein structure prediction with alphafold

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunya- suvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. nature, 596(7873):583–589, 2021

work page 2021

[51] [51]

Chain of agents: Large language models collaborating on long-context tasks

Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan Arik. Chain of agents: Large language models collaborating on long-context tasks. Advances in Neural Information Processing Systems , 37:132208–132237, 2024

work page 2024

[52] [52]

Gpipe: Efficient training of giant neural networks using pipeline parallelism

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems , 32, 2019

work page 2019

[53] [53]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning , pages 28492–28518. PMLR, 2023

work page 2023

[54] [54]

Long range arena: A benchmark for efficient transformers

Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020

work page arXiv 2011

[55] [55]

Spectral normalisation for deep reinforcement learning: an optimisation perspective

Florin Gogianu, Tudor Berariu, Mihaela C Rosca, Claudia Clopath, Lucian Busoniu, and Razvan Pascanu. Spectral normalisation for deep reinforcement learning: an optimisation perspective. In International Conference on Machine Learning, pages 3734–3744. PMLR, 2021

work page 2021

[56] [56]

Graph-mamba: Towards long-range graph sequence modeling with se- lective state spaces.arXiv preprint arXiv:2402.00789,

Chloe Wang, Oleksii Tsepa, Jun Ma, and Bo Wang. Graph-mamba: Towards long-range graph sequence modeling with selective state spaces. arXiv preprint arXiv:2402.00789, 2024

work page arXiv 2024

[57] [57]

Computation-efficient era: A comprehensive survey of state space models in medical image analysis

Moein Heidari, Sina Ghorbani Kolahi, Sanaz Karimijafarbigloo, Bobby Azad, Afshin Bozorgpour, Soheila Hatami, Reza Azad, Ali Diba, Ulas Bagci, Dorit Merhof, et al. Computation-efficient era: A comprehensive survey of state space models in medical image analysis. arXiv preprint arXiv:2406.03430, 2024

work page arXiv 2024

[58] [58]

A survey on visual mamba

Hanwei Zhang, Ying Zhu, Dan Wang, Lijun Zhang, Tianxiang Chen, Ziyang Wang, and Zi Ye. A survey on visual mamba. Applied Sciences, 14(13):5683, 2024

work page 2024

[59] [59]

The hidden attention of mamba models

Ameen Ali, Itamar Zimerman, and Lior Wolf. The hidden attention of mamba models. arXiv preprint arXiv:2403.01590, 2024

work page arXiv 2024

[60] [60]

Mambats: improved selective state space models for long-term time series forecasting

Xiuding Cai, Yaoyao Zhu, Xueyao Wang, and Yu Yao. Mambats: improved selective state space models for long-term time series forecasting. arXiv preprint arXiv:2405.16440, 2024

work page arXiv 2024

[61] [61]

Coupled mamba: Enhanced multi-modal fusion with coupled state space model

Wenbing Li, Hang Zhou, Junqing Yu, Zikai Song, and Wei Yang. Coupled mamba: Enhanced multi-modal fusion with coupled state space model. arXiv preprint arXiv:2405.18014, 2024

work page arXiv 2024

[62] [62]

Vl-mamba: Exploring state space models for multimodal learning

Yanyuan Qiao, Zheng Yu, Longteng Guo, Sihan Chen, Zijia Zhao, Mingzhen Sun, Qi Wu, and Jing Liu. Vl-mamba: Exploring state space models for multimodal learning. arXiv preprint arXiv:2403.13600, 2024

work page arXiv 2024

[63] [63]

Kalmamba: Towards efficient probabilistic state space models for rl under uncertainty

Philipp Becker, Niklas Freymuth, and Gerhard Neumann. Kalmamba: Towards efficient probabilistic state space models for rl under uncertainty. arXiv preprint arXiv:2406.15131, 2024

work page arXiv 2024

[64] [64]

Simba: Simplified mamba-based architecture for vision and multivariate time series

Badri N Patro and Vijay S Agneeswaran. Simba: Simplified mamba-based architecture for vision and multivariate time series. arXiv preprint arXiv:2403.15360, 2024

work page arXiv 2024

[65] [65]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[66] [66]

Multilingual state space models for structured question answering in indic languages

Arpita Vats, Rahul Raja, Mrinal Mathur, Vinija Jain, and Aman Chadha. Multilingual state space models for structured question answering in indic languages. arXiv preprint arXiv:2502.01673, 2025

work page arXiv 2025

[67] [67]

Zamba: A compact 7B SSM hybrid model,

Paolo Glorioso, Minghan He, Yehonatan Rozen, Alex Kuefler, Omer Lieber, Brendan Millidge, Peter Battaglia, Aran Komatsuzaki, Aäron van den Oord, Alex Graves, et al. Zamba: A compact 7b ssm hybrid model. arXiv preprint J. ACM, Vol. , No. , Article . Publication date: March 2025. 30 Somvanshi et al. arXiv:2405.16712, 2024

work page arXiv 2025

[68] [68]

S4nd: Modeling images and videos as multidimensional signals with state spaces

Eric Nguyen, Karan Goel, Albert Gu, Gordon Downs, Preey Shah, Tri Dao, Stephen Baccus, and Christopher Ré. S4nd: Modeling images and videos as multidimensional signals with state spaces. Advances in neural information processing systems, 35:2846–2861, 2022

work page 2022

[69] [69]

Diagonal state spaces are as effective as structured state spaces.Advances in Neural Information Processing Systems , 35:22982–22994, 2022

Ankit Gupta, Albert Gu, and Jonathan Berant. Diagonal state spaces are as effective as structured state spaces.Advances in Neural Information Processing Systems , 35:22982–22994, 2022

work page 2022

[70] [70]

Liquid structural state-space models

Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, and Daniela Rus. Liquid structural state-space models. arXiv preprint arXiv:2209.12951, 2022

work page arXiv 2022

[71] [71]

Hyena hierarchy: Towards larger convolutional language models

Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning, pages 28043–28078. PMLR, 2023

work page 2023

[72] [72]

On the parameterization and initialization of diagonal state space models

Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré. On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems , 35:35971–35983, 2022

work page 2022

[73] [73]

Mega: Moving average equipped gated attention

Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettle- moyer. Mega: Moving average equipped gated attention. arXiv preprint arXiv:2209.10655, 2022

work page arXiv 2022

[74] [74]

RWKV: Reinventing RNNs for the Transformer Era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[75] [75]

J. Ma, F. Li, and B. Wang. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint, arXiv:2401.04722, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[76] [76]

Fusionmamba: Dynamic feature enhancement for multimodal image fusion with mamba

Xinyu Xie, Yawen Cui, Tao Tan, Xubin Zheng, and Zitong Yu. Fusionmamba: Dynamic feature enhancement for multimodal image fusion with mamba. Visual Intelligence, 2(1):37, 2024

work page 2024

[77] [77]

A survey of rwkv

Zhiyuan Li, Tingyu Xia, Yi Chang, and Yuan Wu. A survey of rwkv. arXiv preprint arXiv:2412.14847, 2024

work page arXiv 2024

[78] [78]

Linear recurrent units for sequential recommendation

Zhenrui Yue, Yueqi Wang, Zhankui He, Huimin Zeng, Julian McAuley, and Dong Wang. Linear recurrent units for sequential recommendation. In Proceedings of the 17th ACM international conference on web search and data mining , pages 930–938, 2024

work page 2024

[79] [79]

Soft hierarchical graph recurrent networks for many-agent partially observable environments

Zhenhui Ye, Xiaohong Jiang, Guanghua Song, and Bowei Yang. Soft hierarchical graph recurrent networks for many-agent partially observable environments. arXiv preprint arXiv:2109.02032, 2021

work page arXiv 2021

[80] [80]

Soft-hgrns: soft hierarchical graph recurrent networks for multi-agent partially observable environments

Yixiang Ren, Zhenhui Ye, Yining Chen, Xiaohong Jiang, and Guanghua Song. Soft-hgrns: soft hierarchical graph recurrent networks for multi-agent partially observable environments. Frontiers of Information Technology & Electronic Engineering, 24(1):117–130, 2023

work page 2023