super hub Canonical reference

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel · 2021 · cs.LG · arXiv 2111.00396

Canonical reference. 78% of citing Pith papers cite this work as background.

156 Pith papers citing it

Background 78% of classified citations

open full Pith review browse 156 citing papers more from Albert Gu arXiv PDF

abstract

A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of $10000$ or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) $ x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t) $, and showed that for appropriate choices of the state matrix $ A $, this system could handle long-range dependencies mathematically and empirically. However, this method has prohibitive computation and memory requirements, rendering it infeasible as a general sequence modeling solution. We propose the Structured State Space sequence model (S4) based on a new parameterization for the SSM, and show that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths. Our technique involves conditioning $ A $ with a low-rank correction, allowing it to be diagonalized stably and reducing the SSM to the well-studied computation of a Cauchy kernel. S4 achieves strong empirical results across a diverse range of established benchmarks, including (i) 91\% accuracy on sequential CIFAR-10 with no data augmentation or auxiliary losses, on par with a larger 2-D ResNet, (ii) substantially closing the gap to Transformers on image and language modeling tasks, while performing generation $60\times$ faster (iii) SoTA on every task from the Long Range Arena benchmark, including solving the challenging Path-X task of length 16k that all prior work fails on, while being as efficient as all competitors.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 21 method 5 baseline 1

citation-polarity summary

background 21 use method 5 baseline 1

claims ledger

abstract A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of $10000$ or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) $ x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t) $, and showed that for appropriate choices of the

authors

Albert Gu Christopher R\'e Karan Goel

co-cited works

representative citing papers

Rotation Equivariant Mamba for Vision Tasks

cs.CV · 2026-03-10 · unverdicted · novelty 8.0

EQ-VMamba adds rotation-equivariant cross-scan and group Mamba blocks to enforce end-to-end rotation equivariance, yielding better rotation robustness, competitive accuracy, and roughly 50% fewer parameters than non-equivariant baselines across classification, segmentation, and super-resolution.

Test-Time Training with KV Binding Is Secretly Linear Attention

cs.LG · 2026-02-24 · conditional · novelty 8.0

Test-time training with KV binding reduces to learned linear attention.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding

cs.LG · 2026-07-02 · unverdicted · novelty 7.0

Set diffusion factorizes likelihood over arbitrary token sets and uses a set-causal diffusion architecture to support KV caching and any-order decoding, yielding improved speed-quality tradeoffs versus prior diffusion LMs.

MASS: Motion-Aligned Selective Scan for Refinement in Flow-Based Video Frame Interpolation

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

MASS reformulates SSM-based feature scanning in flow-based VFI to follow dynamic motion trajectories via learnable path integration and velocity-aware sampling, claiming SOTA on challenging large-displacement cases.

SSM Adapters via Hankel Reduced-order Modeling: Injection Site Determines Task Suitability in Long-Context Fine-Tuning

cs.LG · 2026-06-24 · unverdicted · novelty 7.0

HRM adapters via Hankel reduced-order modeling outperform LoRA on long-context tasks in Mistral-7B when used as SSM residual modules with FFT-based parallel scan.

Frequency Domain Reservoir Computing

cs.LG · 2026-06-23 · unverdicted · novelty 7.0

FRESCO is a frequency-domain Echo State Network using zero-padding embeddings, packed readout, and native frequency non-linearity to achieve O(N) complexity while matching SOTA on memory and forecasting benchmarks.

LongSpike: Fractional Order Spiking State Space Models for Efficient Long Sequence Learning

cs.LG · 2026-06-11 · unverdicted · novelty 7.0

LongSpike integrates fractional-order state-space modeling into spiking neural networks, enabling better long-sequence performance than prior SNNs on LRA, WikiText-103, and Speech Commands benchmarks while retaining sparse computation.

Between Amnesia and Chaos: A Memory Stability Expressivity Trilemma for Trainable Dissipative Oscillator Networks

cs.LG · 2026-06-07 · unverdicted · novelty 7.0

Trainable dissipative oscillator networks exhibit a trilemma in which damping governs memory horizon, gradient stability, and Lyapunov exponent, with learned substrates outperforming frozen ones only at short horizons before the advantage closes near eleven steps.

MOSAIC: A Workload-Driven Simulation and Design-Space Exploration Framework for Heterogeneous NPUs

cs.AR · 2026-06-03 · unverdicted · novelty 7.0

MOSAIC is a simulation and DSE framework for heterogeneous NPUs that finds designs achieving 46.91% mean iso-area energy savings over homogeneous baselines on 20 workloads.

Learning Long Range Spatio-Temporal Representations over Continuous Time Dynamic Graphs with State Space Models

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

CTDG-SSM introduces CTT-HiPPO, a Laplacian-polynomial projection of HiPPO, to create a parameter-efficient state-space formulation for continuous-time dynamic graphs that captures long-range spatio-temporal patterns.

AURA: Action-Gated Memory for Robot Policies at Constant VRAM

cs.AI · 2026-06-01 · unverdicted · novelty 7.0

AURA-Mem uses an action-gated recurrent memory trained on closed-loop action error to deliver constant 4,224-byte state and 5-9x fewer writes than baselines while matching base policy success on LIBERO-Long.

Trading Complexity for Expressivity Through Structured Generalized Linear Token Mixing

cs.LG · 2026-05-29 · unverdicted · novelty 7.0

Presents a structured generalized linear token mixing framework that extends recurrence equations to multiple past states, enabling new patterns with provable complexity-expressivity trade-offs for causal generation.

UWM-JEPA: Predictive World Models That Imagine in Belief Space

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

UWM-JEPA uses a density-matrix latent and unitary predictor in JEPA to preserve joint-state spectrum during blind rollouts, achieving 0.77 accuracy on a five-step hidden-velocity task versus 0.53 for an LSTM baseline.

Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation

cs.CV · 2026-05-24 · unverdicted · novelty 7.0

MVCHead uses a hierarchical state space model with bi-directional scans and an SE(3) critic to enforce 3D consistency in Gaussian avatars trained only on 2D images.

Exact expression for maximum Lyapunov exponent during transients in computationally powerful dynamical networks

nlin.CD · 2026-05-20 · unverdicted · novelty 7.0

Exact analytical expression for the time-dependent maximum Lyapunov exponent during transients in a network supporting dynamics-based computation.

Social-Mamba: Socially-Aware Trajectory Forecasting with State-Space Models

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

Social-Mamba introduces a Cycle Mamba block and social triplet factorization to achieve state-of-the-art trajectory forecasting accuracy with linear-time social interaction modeling on five benchmarks.

A Novel Schur-Decomposition-Based Weight Projection Method for Stable State-Space Neural-Network Architectures

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

A real Schur decomposition projection maps the state matrix of discrete-time state-space layers onto its nearest stable counterpart, delivering accuracy comparable to prior stable identification methods with fewer weights.

QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

QLAM extends state-space models with quantum superposition in the hidden state for linear-time long-sequence modeling and reports consistent gains over RNN and transformer baselines on sequential image tasks.

Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo

cond-mat.str-el · 2026-05-13 · conditional · novelty 7.0

PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.

Selection, Not Fusion: Radar-Modulated State Space Models for Radar-Camera Depth Estimation

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

Radar-Modulated Selection perturbs only the step size Δ and readout C parameters inside Mamba's selective scan with radar data while keeping other components image-only, yielding state-of-the-art depth estimation on nuScenes with up to 34% MAE reduction.

TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.

TIDES: Implicit Time-Awareness in Selective State Space Models

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and Physiome-ODE benchmarks.

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

PairAlign learns compact variable-length token sequences for audio via self-alignment on paired content-preserving views, achieving 55% fewer archive tokens than VQ while preserving edit-distance retrieval at 12.71 tokens/s.

citing papers explorer

Showing 22 of 22 citing papers after filters.

Kinetic-Mamba: Mamba-Assisted Predictions of Stiff Chemical Kinetics cs.LG · 2025-12-16 · unverdicted · none · ref 23 · internal anchor
Mamba-based neural operators predict stiff chemical kinetics evolution with high fidelity from initial states on Syngas and GRI-Mech 3.0 mechanisms.
L2RU: a Structured State Space Model with prescribed L2-bound eess.SY · 2025-03-31 · unverdicted · none · ref 9 · internal anchor
L2RU parametrizes SSMs to enforce a prescribed L2-gain bound for guaranteed input-output stability and robustness in all parameter regimes.
Mamba-Based Graph Convolutional Networks: Tackling Over-smoothing with Selective State Space cs.LG · 2025-01-26 · unverdicted · none · ref 10 · internal anchor
MbaGCN combines message aggregation, selective state space transitions, and node state prediction to create a more scalable deep graph convolutional network.
Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression cs.LG · 2025-11-26 · unverdicted · none · ref 21 · internal anchor
Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.
Higher-order Linear Attention cs.LG · 2025-10-31 · unverdicted · none · ref 6 · internal anchor
Higher-order Linear Attention realizes second-order and higher interactions in linear-time causal attention via constant-size state and associative scans.
Kimi Linear: An Expressive, Efficient Attention Architecture cs.CL · 2025-10-30 · unverdicted · none · ref 31 · internal anchor
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks cs.LG · 2025-08-16 · unverdicted · none · ref 4 · internal anchor
Gating in RNNs couples state time-scales with parameter gradients to produce lag- and direction-dependent effective learning rates, shown via exact Jacobians and first-order expansion.
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent cs.CL · 2025-07-03 · unverdicted · none · ref 36 · internal anchor
MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
CodeBrain: Bridging Decoupled Tokenizer and Multi-Scale Architecture for EEG Foundation Model cs.LG · 2025-06-10 · unverdicted · none · ref 36 · internal anchor
CodeBrain introduces a decoupled TFDual-Tokenizer and multi-scale EEGSSM architecture for an EEG foundation model pretrained on a large corpus, claiming strong generalization across eight downstream tasks and ten datasets.
Quantitative Error Feedback for Quantization Noise Reduction of Filtering over Graphs cs.LG · 2025-06-02 · unverdicted · none · ref 53 · internal anchor
Introduces quantitative error feedback from digital filter techniques to exactly compensate quantization noise in graph filtering, with closed-form optimal coefficients for deterministic, random-graph, and asynchronous scenarios.
Fine-Grained Fusion: The Missing Piece in Area-Efficient State Space Model Acceleration cs.AR · 2025-04-24 · unverdicted · none · ref 5 · 2 links · internal anchor
Fine-grained fusion and adaptive scheduling in SSMs deliver up to 4.8x speedup and 10x lower on-chip memory, enabling a fusion-aware accelerator with 1.78x higher performance than MARCA at equal area.
Upper Approximation Bounds for Neural Oscillators cs.LG · 2025-11-30 · unverdicted · none · ref 16 · internal anchor
Upper bounds are derived showing that neural oscillator approximation errors for causal operators and stable second-order dynamical systems scale polynomially with the reciprocals of the widths of the two MLPs.
STM3: Mixture of Multiscale Mamba for Long-Term Spatio-Temporal Time-Series Prediction cs.LG · 2025-08-17 · unverdicted · none · ref 18 · 2 links · internal anchor
STM3 is a new multiscale Mamba mixture-of-experts model with graph causal networks and contrastive routing that reports state-of-the-art results on 10 long-term spatio-temporal forecasting benchmarks.
The Serial Scaling Hypothesis cs.LG · 2025-07-16 · unverdicted · none · ref 36 · internal anchor
The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.
FADPNet: Frequency-Aware Dual-Path Network for Face Super-Resolution cs.CV · 2025-06-17 · unverdicted · none · ref 28 · internal anchor
FADPNet decomposes facial features into low- and high-frequency components processed by dedicated Mamba and CNN modules to balance quality and efficiency in face super-resolution.
An Efficient Self-Supervised Framework for Long-Sequence EEG Modeling cs.LG · 2025-02-25 · unverdicted · none · ref 10 · internal anchor
EEGM2 is a Mamba-2 integrated self-supervised model for EEG that claims linear complexity and state-of-the-art performance on long-sequence modeling and classification tasks.
Improving motor imagery decoding methods for an EEG-based mobile brain-computer interface in the context of the 2024 Cybathlon cs.HC · 2025-11-28 · conditional · none · ref 37 · internal anchor
A modular EEG-based BCI with S4D deep learning classifier achieves 84% offline accuracy and enables real-time control for a tetraplegic user, with 73% success in post-competition validation.
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights cs.CL · 2025-10-06 · unverdicted · none · ref 19 · internal anchor
This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-context performance, scaling, and efficiency to derive optimal design recipes.
Advancing Intelligent Sequence Modeling: Evolution, Trade-offs, and Applications of State- Space Architectures from S4 to Mamba cs.LG · 2025-03-22 · unverdicted · none · ref 2 · internal anchor
A survey tracing the evolution of state-space models like S4 and Mamba, their efficiency trade-offs, and applications in NLP, vision, and other domains.
Next-Latent Prediction Transformers Learn Compact World Models cs.LG · 2025-11-08 · unreviewed · ref 12 · internal anchor
Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live cs.OS · 2025-11-04 · unreviewed · ref 29 · internal anchor
Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation cs.LG · 2025-09-04 · unreviewed · ref 33 · internal anchor

Efficiently Modeling Long Sequences with Structured State Spaces

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer