Advancing Intelligent Sequence Modeling: Evolution, Trade-offs, and Applications of State- Space Architectures from S4 to Mamba
Pith reviewed 2026-05-22 22:03 UTC · model grok-4.3
The pith
Structured state space models achieve linear scaling and outperform transformers on long-range sequence tasks across domains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By combining structured recurrence with state-space representations, SSMs deliver linear or near-linear computational scaling while capturing long-range dependencies more effectively than RNNs or transformers, with documented advantages in inference speed and memory use demonstrated from S4 through Mamba across NLP, speech, vision, and time-series applications.
What carries the argument
The structured state space sequence (S4) model and its selective extensions such as Mamba, which replace dense attention with parameterized linear state transitions to maintain constant memory and linear time complexity.
If this is right
- Ultra-long sequences become feasible without quadratic memory growth, enabling direct modeling of entire documents or genomes.
- Inference latency drops measurably, with reported reductions reaching 60 percent in real-time speech synthesis.
- Hybrid SSM-transformer designs allow domain-specific tuning while preserving linear scaling in the dominant layers.
- Resource-constrained settings gain access to state-of-the-art sequence performance previously limited to large GPU clusters.
Where Pith is reading between the lines
- Continued hardware-aware redesign of the state transition matrices could widen the efficiency gap further on edge devices.
- Interpretability tools developed for SSMs may transfer to other linear-time recurrent architectures.
- If training instability issues are resolved, SSMs could replace transformers as the backbone for very long context windows in production systems.
Load-bearing premise
Performance advantages reported for SSMs in the surveyed literature represent their typical behavior rather than results selected from favorable tasks or implementations.
What would settle it
A single large-scale, domain-balanced benchmark in which transformer variants match or beat SSM variants on both accuracy and wall-clock efficiency for sequences longer than 10,000 tokens.
Figures
read the original abstract
Structured State Space Models (SSMs) have emerged as a transformative paradigm in sequence modeling, addressing critical limitations of Recurrent Neural Networks (RNNs) and Transformers, namely, vanishing gradients, sequential computation bottlenecks, and quadratic memory complexity. By integrating structured recurrence with state-space representations, SSMs achieve linear or near-linear computational scaling while excelling in long-range dependency tasks. This study systematically traces the evolution of SSMs from the foundational Structured State Space Sequence (S4) model to modern variants like Mamba, Simplified Structured State Space Sequence (S5), and Jamba, analyzing architectural innovations that enhance computational efficiency, memory optimization, and inference speed. We critically evaluate trade-offs inherent to SSM design, such as balancing expressiveness with computational constraints and integrating hybrid architectures for domain-specific performance. Across domains including natural language processing, speech recognition, computer vision, and time-series forecasting, SSMs demonstrate state-of-the-art results in handling ultra-long sequences, outperforming Transformer-based models in both speed and memory utilization. Case studies highlight applications such as real-time speech synthesis and genomic sequence modeling, where SSMs reduce inference latency by up to 60% compared to traditional approaches. However, challenges persist in training dynamics, interpretability, and hardware-aware optimization. We conclude with a forward-looking analysis of SSMs' potential to redefine scalable deep learning, proposing directions for hybrid systems, theoretical guarantees, and broader adoption in resource-constrained environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This manuscript is a survey tracing the evolution of Structured State Space Models (SSMs) from the S4 model through variants including S5, Mamba, and Jamba. It describes how these architectures integrate structured recurrence and state-space representations to achieve linear or near-linear scaling, address vanishing gradients and quadratic complexity issues of prior models, and improve long-range dependency modeling. The paper reviews architectural innovations for efficiency and memory, evaluates design trade-offs, surveys applications and reported results across NLP, speech, vision, and time-series domains, presents case studies (including latency reductions), and outlines challenges and future directions for hybrid systems and theoretical work.
Significance. If the synthesis of cited results is accurate and balanced, the survey could provide a useful consolidated reference for researchers tracking the shift toward SSMs for long-sequence tasks. Its value would lie in organizing the progression of ideas and highlighting efficiency claims from the literature rather than introducing new empirical findings.
major comments (1)
- [Abstract] Abstract: The quantitative claim that SSMs 'reduce inference latency by up to 60% compared to traditional approaches' in case studies on speech synthesis and genomic modeling is presented without any citation to the specific studies or manuscript sections containing those results. In a survey, such assertions require traceable references to allow verification of representativeness.
minor comments (1)
- The abstract states that the paper 'critically evaluate[s] trade-offs' yet the provided text frames performance advantages primarily as descriptions of prior work; the main text should clarify the extent of original critical analysis versus summarization.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and recommendation of minor revision. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The quantitative claim that SSMs 'reduce inference latency by up to 60% compared to traditional approaches' in case studies on speech synthesis and genomic modeling is presented without any citation to the specific studies or manuscript sections containing those results. In a survey, such assertions require traceable references to allow verification of representativeness.
Authors: We agree that the abstract's quantitative claim requires explicit traceability. The latency reduction figure is supported by the case studies and cited results discussed later in the manuscript (applications sections on speech and genomics). To ensure verifiability as a survey, we will revise the abstract to include direct citations to the relevant studies. This will be incorporated in the revised version. revision: yes
Circularity Check
No significant circularity detected
full rationale
This is a survey paper that traces the evolution of SSM architectures from S4 to Mamba by summarizing and citing prior literature on performance, scaling, and applications. No new derivations, equations, predictions, or fitted parameters are introduced that could reduce to self-referential constructions. Central claims about linear scaling and outperformance are presented as descriptions of existing results rather than internally derived quantities. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results occur in a way that makes the survey's content equivalent to its inputs by definition. The paper remains self-contained as a descriptive review without load-bearing circular reductions.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 5 Pith papers
-
Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation
Robust Filter Attention models self-attention as consistency-based state estimation under a linear SDE for token trajectories, matching standard attention complexity while showing lower perplexity and better zero-shot...
-
RT-Transformer: The Transformer Block as a Spherical State Estimator
Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.
-
HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment
HST-HGN uses heterogeneous spatial-temporal hypergraph networks combined with bidirectional Mamba state space models to achieve state-of-the-art driver fatigue assessment from untrimmed videos while maintaining comput...
-
Deep Learning for Virtual Reality User Identification: A Benchmark
A benchmark study evaluates standard and emerging deep learning architectures on motion data from 71 VR users, establishing performance baselines for user identification.
-
Tabular Data with Class Imbalance: Predicting Electric Vehicle Crash Severity with Pretrained Transformers (TabPFN) and Mamba-Based Models
Benchmarks TabPFN, MambaNet and MambaAttention on imbalanced EV crash severity classification with SMOTEENN resampling on Texas data, identifying intersection relation and speed limit as top features and MambaAttentio...
Reference graph
Works this paper leans on
-
[1]
Badri Narayana Patro and Vijay Srinivas Agneeswaran. Mamba-360: Survey of state space models as transformer alternative for long sequence modelling: Methods, applications, and challenges. arXiv preprint arXiv:2404.16112, 2024
-
[2]
Efficiently Modeling Long Sequences with Structured State Spaces
Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Stablessm: Alleviating the curse of memory in state-space models through stable reparameterization
Shida Wang and Qianxiao Li. Stablessm: Alleviating the curse of memory in state-space models through stable reparameterization. arXiv preprint arXiv:2311.14495, 2023
-
[4]
Efficient long sequence modeling via state space augmented transformer
Simiao Zuo, Xiaodong Liu, Jian Jiao, Denis Charles, Eren Manavoglu, Tuo Zhao, and Jianfeng Gao. Efficient long sequence modeling via state space augmented transformer. arXiv preprint arXiv:2212.08136, 2022
-
[5]
Applying tabular deep learning models to estimate crash injury types of young motorcyclists
Shriyank Somvanshi, Anannya Ghosh Tusti, Subasish Das, and Rohit Chakraborty. Applying tabular deep learning models to estimate crash injury types of young motorcyclists. In IEEE CAI, Santa Clara, California, USA, May 5-7 2025
work page 2025
-
[6]
Crash severity analysis of child bicyclists using arm-net and mambanet
Shriyank Somvanshi, Rohit Chakraborty, Anandi K Dutta, and Subasish Das. Crash severity analysis of child bicyclists using arm-net and mambanet. In IEEE CAI, Santa Clara, California, USA, May 5-7 2025
work page 2025
-
[7]
Mathematical formalism for memory compression in selective state space models
Siddhanth Bhat. Mathematical formalism for memory compression in selective state space models. arXiv preprint arXiv:2410.03158, 2024
-
[8]
Theoretical foundations of deep selective state-space models
Nicola Muca Cirone, Antonio Orvieto, Benjamin Walker, Cristopher Salvi, and Terry Lyons. Theoretical foundations of deep selective state-space models. Advances in Neural Information Processing Systems , 37:127226–127272, 2024
work page 2024
-
[9]
Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Simplified State Space Layers for Sequence Modeling
Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Exploring the capability of mamba in speech applications
Koichi Miyazaki, Yoshiki Masuyama, and Masato Murata. Exploring the capability of mamba in speech applications. arXiv preprint arXiv:2406.16808, 2024
-
[12]
Shubhi Bansal, Sreekanth Madisetty, Mohammad Zia Ur Rehman, Chandravardhan Singh Raghaw, Gaurav Duggal, Nagendra Kumar, et al. A comprehensive survey of mamba architectures for medical image analysis: Classification, segmentation, restoration and beyond. arXiv preprint arXiv:2410.02362, 2024
-
[13]
Jamba-1.5: Hybrid transformer-mamba models at scale
Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, et al. Jamba-1.5: Hybrid transformer-mamba models at scale.arXiv preprint arXiv:2408.12570, 2024
-
[14]
State space model for new-generation network alternative to transformers: A survey
Xiao Wang, Shiao Wang, Yuhe Ding, Yuehang Li, Wentao Wu, Yao Rong, Weizhe Kong, Ju Huang, Shihao Li, Haoxiang Yang, et al. State space model for new-generation network alternative to transformers: A survey. arXiv preprint arXiv:2404.09516, 2024. J. ACM, Vol. , No. , Article . Publication date: March 2025. 28 Somvanshi et al
-
[15]
Analysis and control of nonlinear process systems
Katalin M Hangos, József Bokor, and Gábor Szederkényi. Analysis and control of nonlinear process systems . Springer Science & Business Media, 2006
work page 2006
-
[16]
A new approach to linear filtering and prediction problems
Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering, 82(1):35–45, 1960
work page 1960
-
[17]
James D Hamilton. Time series analysis. Princeton university press, 2020
work page 2020
-
[18]
Thomas Kailath. Linear systems, volume 156. Prentice-Hall Englewood Cliffs, NJ, 1980
work page 1980
-
[19]
Combining recurrent, convolutional, and continuous-time models with linear state space layers
Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021
work page 2021
-
[20]
Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990
work page 1990
-
[21]
Serial order: A parallel distributed processing approach
Michael I Jordan. Serial order: A parallel distributed processing approach. In Advances in psychology, volume 121, pages 471–495. Elsevier, 1997
work page 1997
-
[22]
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997
work page 1997
-
[23]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Backpropagation applied to handwritten zip code recognition
Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989
work page 1989
-
[25]
Gradient-based learning applied to document recogni- tion
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recogni- tion. Proceedings of the IEEE , 86(11):2278–2324, 1998
work page 1998
-
[26]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems , 30, 2017
work page 2017
-
[27]
A comparison of lstm and gru networks for learning symbolic sequences
Roberto Cahuantzi, Xinye Chen, and Stefan Güttel. A comparison of lstm and gru networks for learning symbolic sequences. In Science and Information Conference , pages 771–785. Springer, 2023
work page 2023
-
[28]
Abhinav Agrawal and Namita Mittal. Using cnn for facial expression recognition: a study of the effects of kernel size and number of filters on accuracy. The Visual Computer, 36(2):405–412, 2020
work page 2020
-
[29]
Theoretical foundations of deep selective state-space models
Nicola Muca Cirone, Antonio Orvieto, Benjamin Walker, Cristopher Salvi, and Terry Lyons. Theoretical foundations of deep selective state-space models. arXiv preprint arXiv:2402.19047, 2024
-
[30]
Zifeng Ding, Yifeng Li, Yuan He, Antonio Norelli, Jingcheng Wu, Volker Tresp, Yunpu Ma, and Michael Bronstein. Dygmamba: Efficiently modeling long-term temporal dependency on continuous-time dynamic graphs with state space models. arXiv preprint arXiv:2408.04713, 2024
-
[31]
Learning long-term dependencies with gradient descent is difficult
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks , 5(2):157–166, 1994
work page 1994
-
[32]
How to train your hippo: State space models with generalized orthogonal basis projections
Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, and Christopher Ré. How to train your hippo: State space models with generalized orthogonal basis projections. arXiv preprint arXiv:2206.12037, 2022
-
[33]
Lennart Ljung et al. Theory for the user. System identification, 1987
work page 1987
-
[34]
Neural ordinary differential equations
Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems , 31, 2018
work page 2018
-
[35]
Hippo: Recurrent memory with optimal polynomial projections
Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems , 33:1474–1487, 2020
work page 2020
-
[36]
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[37]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning , pages 5156–5165. PMLR, 2020
work page 2020
-
[38]
Efficient transformers: A survey
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022
work page 2022
-
[39]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020
work page 1901
-
[40]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF international conference on computer vision , pages 1728–1738, 2021
work page 2021
-
[41]
Temporal fusion transformers for interpretable multi- horizon time series forecasting
Bryan Lim, Sercan Ö Arık, Nicolas Loeff, and Tomas Pfister. Temporal fusion transformers for interpretable multi- horizon time series forecasting. International Journal of Forecasting , 37(4):1748–1764, 2021
work page 2021
-
[42]
Speech recognition with deep recurrent neural networks
Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing , pages 6645–6649. Ieee, 2013
work page 2013
-
[43]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. J. ACM, Vol. , No. , Article . Publication date: March 2025. From S4 to Mamba: A Comprehensi...
work page 2020
-
[44]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[45]
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[46]
Linformer: Self-Attention with Linear Complexity
Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[47]
Modeling multivariate biosignals with graph neural networks and structured state space models
Siyi Tang, Jared A Dunnmon, Qu Liangqiong, Khaled K Saab, Tina Baykaner, Christopher Lee-Messer, and Daniel L Rubin. Modeling multivariate biosignals with graph neural networks and structured state space models. In Conference on health, inference, and learning , pages 50–71. PMLR, 2023
work page 2023
-
[48]
Spectral Normalization for Generative Adversarial Networks
Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[49]
Convolutional state space models for long-range spatiotemporal modeling
Jimmy Smith, Shalini De Mello, Jan Kautz, Scott Linderman, and Wonmin Byeon. Convolutional state space models for long-range spatiotemporal modeling. Advances in Neural Information Processing Systems , 36:80690–80729, 2023
work page 2023
-
[50]
Highly accurate protein structure prediction with alphafold
John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunya- suvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. nature, 596(7873):583–589, 2021
work page 2021
-
[51]
Chain of agents: Large language models collaborating on long-context tasks
Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan Arik. Chain of agents: Large language models collaborating on long-context tasks. Advances in Neural Information Processing Systems , 37:132208–132237, 2024
work page 2024
-
[52]
Gpipe: Efficient training of giant neural networks using pipeline parallelism
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems , 32, 2019
work page 2019
-
[53]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning , pages 28492–28518. PMLR, 2023
work page 2023
-
[54]
Long range arena: A benchmark for efficient transformers
Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020
-
[55]
Spectral normalisation for deep reinforcement learning: an optimisation perspective
Florin Gogianu, Tudor Berariu, Mihaela C Rosca, Claudia Clopath, Lucian Busoniu, and Razvan Pascanu. Spectral normalisation for deep reinforcement learning: an optimisation perspective. In International Conference on Machine Learning, pages 3734–3744. PMLR, 2021
work page 2021
-
[56]
Chloe Wang, Oleksii Tsepa, Jun Ma, and Bo Wang. Graph-mamba: Towards long-range graph sequence modeling with selective state spaces. arXiv preprint arXiv:2402.00789, 2024
-
[57]
Computation-efficient era: A comprehensive survey of state space models in medical image analysis
Moein Heidari, Sina Ghorbani Kolahi, Sanaz Karimijafarbigloo, Bobby Azad, Afshin Bozorgpour, Soheila Hatami, Reza Azad, Ali Diba, Ulas Bagci, Dorit Merhof, et al. Computation-efficient era: A comprehensive survey of state space models in medical image analysis. arXiv preprint arXiv:2406.03430, 2024
-
[58]
Hanwei Zhang, Ying Zhu, Dan Wang, Lijun Zhang, Tianxiang Chen, Ziyang Wang, and Zi Ye. A survey on visual mamba. Applied Sciences, 14(13):5683, 2024
work page 2024
-
[59]
The hidden attention of mamba models
Ameen Ali, Itamar Zimerman, and Lior Wolf. The hidden attention of mamba models. arXiv preprint arXiv:2403.01590, 2024
-
[60]
Mambats: improved selective state space models for long-term time series forecasting
Xiuding Cai, Yaoyao Zhu, Xueyao Wang, and Yu Yao. Mambats: improved selective state space models for long-term time series forecasting. arXiv preprint arXiv:2405.16440, 2024
-
[61]
Coupled mamba: Enhanced multi-modal fusion with coupled state space model
Wenbing Li, Hang Zhou, Junqing Yu, Zikai Song, and Wei Yang. Coupled mamba: Enhanced multi-modal fusion with coupled state space model. arXiv preprint arXiv:2405.18014, 2024
-
[62]
Vl-mamba: Exploring state space models for multimodal learning
Yanyuan Qiao, Zheng Yu, Longteng Guo, Sihan Chen, Zijia Zhao, Mingzhen Sun, Qi Wu, and Jing Liu. Vl-mamba: Exploring state space models for multimodal learning. arXiv preprint arXiv:2403.13600, 2024
-
[63]
Kalmamba: Towards efficient probabilistic state space models for rl under uncertainty
Philipp Becker, Niklas Freymuth, and Gerhard Neumann. Kalmamba: Towards efficient probabilistic state space models for rl under uncertainty. arXiv preprint arXiv:2406.15131, 2024
-
[64]
Simba: Simplified mamba-based architecture for vision and multivariate time series
Badri N Patro and Vijay S Agneeswaran. Simba: Simplified mamba-based architecture for vision and multivariate time series. arXiv preprint arXiv:2403.15360, 2024
-
[65]
Jamba: A Hybrid Transformer-Mamba Language Model
Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
Multilingual state space models for structured question answering in indic languages
Arpita Vats, Rahul Raja, Mrinal Mathur, Vinija Jain, and Aman Chadha. Multilingual state space models for structured question answering in indic languages. arXiv preprint arXiv:2502.01673, 2025
-
[67]
Zamba: A compact 7B SSM hybrid model,
Paolo Glorioso, Minghan He, Yehonatan Rozen, Alex Kuefler, Omer Lieber, Brendan Millidge, Peter Battaglia, Aran Komatsuzaki, Aäron van den Oord, Alex Graves, et al. Zamba: A compact 7b ssm hybrid model. arXiv preprint J. ACM, Vol. , No. , Article . Publication date: March 2025. 30 Somvanshi et al. arXiv:2405.16712, 2024
-
[68]
S4nd: Modeling images and videos as multidimensional signals with state spaces
Eric Nguyen, Karan Goel, Albert Gu, Gordon Downs, Preey Shah, Tri Dao, Stephen Baccus, and Christopher Ré. S4nd: Modeling images and videos as multidimensional signals with state spaces. Advances in neural information processing systems, 35:2846–2861, 2022
work page 2022
-
[69]
Ankit Gupta, Albert Gu, and Jonathan Berant. Diagonal state spaces are as effective as structured state spaces.Advances in Neural Information Processing Systems , 35:22982–22994, 2022
work page 2022
-
[70]
Liquid structural state-space models
Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, and Daniela Rus. Liquid structural state-space models. arXiv preprint arXiv:2209.12951, 2022
-
[71]
Hyena hierarchy: Towards larger convolutional language models
Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning, pages 28043–28078. PMLR, 2023
work page 2023
-
[72]
On the parameterization and initialization of diagonal state space models
Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré. On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems , 35:35971–35983, 2022
work page 2022
-
[73]
Mega: Moving average equipped gated attention
Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettle- moyer. Mega: Moving average equipped gated attention. arXiv preprint arXiv:2209.10655, 2022
-
[74]
RWKV: Reinventing RNNs for the Transformer Era
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[75]
J. Ma, F. Li, and B. Wang. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint, arXiv:2401.04722, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[76]
Fusionmamba: Dynamic feature enhancement for multimodal image fusion with mamba
Xinyu Xie, Yawen Cui, Tao Tan, Xubin Zheng, and Zitong Yu. Fusionmamba: Dynamic feature enhancement for multimodal image fusion with mamba. Visual Intelligence, 2(1):37, 2024
work page 2024
-
[77]
Zhiyuan Li, Tingyu Xia, Yi Chang, and Yuan Wu. A survey of rwkv. arXiv preprint arXiv:2412.14847, 2024
-
[78]
Linear recurrent units for sequential recommendation
Zhenrui Yue, Yueqi Wang, Zhankui He, Huimin Zeng, Julian McAuley, and Dong Wang. Linear recurrent units for sequential recommendation. In Proceedings of the 17th ACM international conference on web search and data mining , pages 930–938, 2024
work page 2024
-
[79]
Soft hierarchical graph recurrent networks for many-agent partially observable environments
Zhenhui Ye, Xiaohong Jiang, Guanghua Song, and Bowei Yang. Soft hierarchical graph recurrent networks for many-agent partially observable environments. arXiv preprint arXiv:2109.02032, 2021
-
[80]
Yixiang Ren, Zhenhui Ye, Yining Chen, Xiaohong Jiang, and Guanghua Song. Soft-hgrns: soft hierarchical graph recurrent networks for multi-agent partially observable environments. Frontiers of Information Technology & Electronic Engineering, 24(1):117–130, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.