super hub Canonical reference

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel · 2021 · cs.LG · arXiv 2111.00396

Canonical reference. 77% of citing Pith papers cite this work as background.

131 Pith papers citing it

Background 77% of classified citations

open full Pith review browse 131 citing papers more from Albert Gu arXiv PDF

abstract

A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of $10000$ or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) $ x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t) $, and showed that for appropriate choices of the state matrix $ A $, this system could handle long-range dependencies mathematically and empirically. However, this method has prohibitive computation and memory requirements, rendering it infeasible as a general sequence modeling solution. We propose the Structured State Space sequence model (S4) based on a new parameterization for the SSM, and show that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths. Our technique involves conditioning $ A $ with a low-rank correction, allowing it to be diagonalized stably and reducing the SSM to the well-studied computation of a Cauchy kernel. S4 achieves strong empirical results across a diverse range of established benchmarks, including (i) 91\% accuracy on sequential CIFAR-10 with no data augmentation or auxiliary losses, on par with a larger 2-D ResNet, (ii) substantially closing the gap to Transformers on image and language modeling tasks, while performing generation $60\times$ faster (iii) SoTA on every task from the Long Range Arena benchmark, including solving the challenging Path-X task of length 16k that all prior work fails on, while being as efficient as all competitors.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 20 method 5 baseline 1

citation-polarity summary

background 20 use method 5 baseline 1

claims ledger

abstract A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of $10000$ or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) $ x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t) $, and showed that for appropriate choices of the

authors

Albert Gu Christopher R\'e Karan Goel

co-cited works

representative citing papers

Rotation Equivariant Mamba for Vision Tasks

cs.CV · 2026-03-10 · unverdicted · novelty 8.0

EQ-VMamba adds rotation-equivariant cross-scan and group Mamba blocks to enforce end-to-end rotation equivariance, yielding better rotation robustness, competitive accuracy, and roughly 50% fewer parameters than non-equivariant baselines across classification, segmentation, and super-resolution.

Test-Time Training with KV Binding Is Secretly Linear Attention

cs.LG · 2026-02-24 · conditional · novelty 8.0

Test-time training with KV binding reduces to learned linear attention.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

MASS: Motion-Aligned Selective Scan for Refinement in Flow-Based Video Frame Interpolation

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

MASS reformulates SSM-based feature scanning in flow-based VFI to follow dynamic motion trajectories via learnable path integration and velocity-aware sampling, claiming SOTA on challenging large-displacement cases.

MOSAIC: A Workload-Driven Simulation and Design-Space Exploration Framework for Heterogeneous NPUs

cs.AR · 2026-06-03 · unverdicted · novelty 7.0

MOSAIC is a simulation and DSE framework for heterogeneous NPUs that finds designs achieving 46.91% mean iso-area energy savings over homogeneous baselines on 20 workloads.

Learning Long Range Spatio-Temporal Representations over Continuous Time Dynamic Graphs with State Space Models

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

CTDG-SSM introduces CTT-HiPPO, a Laplacian-polynomial projection of HiPPO, to create a parameter-efficient state-space formulation for continuous-time dynamic graphs that captures long-range spatio-temporal patterns.

AURA: Action-Gated Memory for Robot Policies at Constant VRAM

cs.AI · 2026-06-01 · unverdicted · novelty 7.0

AURA-Mem uses an action-gated recurrent memory trained on closed-loop action error to deliver constant 4,224-byte state and 5-9x fewer writes than baselines while matching base policy success on LIBERO-Long.

Trading Complexity for Expressivity Through Structured Generalized Linear Token Mixing

cs.LG · 2026-05-29 · unverdicted · novelty 7.0

Presents a structured generalized linear token mixing framework that extends recurrence equations to multiple past states, enabling new patterns with provable complexity-expressivity trade-offs for causal generation.

UWM-JEPA: Predictive World Models That Imagine in Belief Space

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

UWM-JEPA uses a density-matrix latent and unitary predictor in JEPA to preserve joint-state spectrum during blind rollouts, achieving 0.77 accuracy on a five-step hidden-velocity task versus 0.53 for an LSTM baseline.

Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation

cs.CV · 2026-05-24 · unverdicted · novelty 7.0

MVCHead uses a hierarchical state space model with bi-directional scans and an SE(3) critic to enforce 3D consistency in Gaussian avatars trained only on 2D images.

Exact expression for maximum Lyapunov exponent during transients in computationally powerful dynamical networks

nlin.CD · 2026-05-20 · unverdicted · novelty 7.0

Exact analytical expression for the time-dependent maximum Lyapunov exponent during transients in a network supporting dynamics-based computation.

Social-Mamba: Socially-Aware Trajectory Forecasting with State-Space Models

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

Social-Mamba introduces a Cycle Mamba block and social triplet factorization to achieve state-of-the-art trajectory forecasting accuracy with linear-time social interaction modeling on five benchmarks.

A Novel Schur-Decomposition-Based Weight Projection Method for Stable State-Space Neural-Network Architectures

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

A real Schur decomposition projection maps the state matrix of discrete-time state-space layers onto its nearest stable counterpart, delivering accuracy comparable to prior stable identification methods with fewer weights.

QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

QLAM extends state-space models with quantum superposition in the hidden state for linear-time long-sequence modeling and reports consistent gains over RNN and transformer baselines on sequential image tasks.

Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo

cond-mat.str-el · 2026-05-13 · conditional · novelty 7.0

PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.

Selection, Not Fusion: Radar-Modulated State Space Models for Radar-Camera Depth Estimation

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

Radar-Modulated Selection perturbs only the step size Δ and readout C parameters inside Mamba's selective scan with radar data while keeping other components image-only, yielding state-of-the-art depth estimation on nuScenes with up to 34% MAE reduction.

TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.

TIDES: Implicit Time-Awareness in Selective State Space Models

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and Physiome-ODE benchmarks.

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

PairAlign learns compact variable-length token sequences for audio via self-alignment on paired content-preserving views, achieving 55% fewer archive tokens than VQ while preserving edit-distance retrieval at 12.71 tokens/s.

Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement

cs.CV · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

NOVA represents world states as INR weights for decoder-free rendering, compactness, and unsupervised disentanglement of background, foreground, and motion in video world models.

How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate for larger t.

The Predictive-Causal Gap: An Impossibility Theorem and Large-Scale Neural Evidence

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

Predictive representation learning structurally favors encoding slower or less noisy environment modes over causal system modes, as shown by an impossibility theorem for linear-Gaussian dynamics and large-scale neural experiments.

FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connections for better information flow.

Rethink MAE with Linear Time-Invariant Dynamics

cs.CV · 2026-04-29 · unverdicted · novelty 7.0

Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.

citing papers explorer

Showing 31 of 131 citing papers.

Sessa: Selective State Space Attention cs.LG · 2026-04-20 · unverdicted · none · ref 9 · internal anchor
Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.
MedMamba: Recasting Mamba for Medical Time Series Classification eess.SP · 2026-04-17 · unverdicted · none · ref 14 · internal anchor
MedMamba introduces a principle-guided bidirectional multi-scale Mamba model that outperforms prior methods on EEG, ECG, and activity classification benchmarks while delivering 4.6x inference speedup.
A Mamba-Based Multimodal Network for Multiscale Blast-Induced Rapid Structural Damage Assessment cs.AI · 2026-04-13 · unverdicted · none · ref 11 · internal anchor
A new Mamba multimodal network integrates multi-scale blast-loading information with satellite images to improve rapid structural damage assessment after explosions, showing gains over prior methods on the Beirut 2020 case.
Structured State-Space Regularization for Generation-Friendly Image Tokenization cs.CV · 2026-04-13 · unverdicted · none · ref 20 · 2 links · internal anchor
Structured state-space regularization induces spectral structure in image tokenizer latent spaces via an SSM-derived objective, improving generative performance with minimal reconstruction loss.
CARE-ECG: Causal Agent-based Reasoning for Explainable and Counterfactual ECG Interpretation cs.LG · 2026-04-12 · unverdicted · none · ref 10 · internal anchor
CARE-ECG unifies ECG representation learning, causal graph-based diagnosis, and counterfactual assessment in an agentic LLM pipeline to improve accuracy and explanation faithfulness.
HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment cs.CV · 2026-04-09 · unverdicted · none · ref 16 · internal anchor
HST-HGN uses heterogeneous spatial-temporal hypergraph networks combined with bidirectional Mamba state space models to achieve state-of-the-art driver fatigue assessment from untrimmed videos while maintaining computational efficiency for real-time use.
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation cs.CV · 2026-04-06 · unverdicted · none · ref 16 · internal anchor
Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
Upper Approximation Bounds for Neural Oscillators cs.LG · 2025-11-30 · unverdicted · none · ref 16 · internal anchor
Upper bounds are derived showing that neural oscillator approximation errors for causal operators and stable second-order dynamical systems scale polynomially with the reciprocals of the widths of the two MLPs.
STM3: Mixture of Multiscale Mamba for Long-Term Spatio-Temporal Time-Series Prediction cs.LG · 2025-08-17 · unverdicted · none · ref 18 · 2 links · internal anchor
STM3 is a new multiscale Mamba mixture-of-experts model with graph causal networks and contrastive routing that reports state-of-the-art results on 10 long-term spatio-temporal forecasting benchmarks.
The Serial Scaling Hypothesis cs.LG · 2025-07-16 · unverdicted · none · ref 36 · internal anchor
The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.
FADPNet: Frequency-Aware Dual-Path Network for Face Super-Resolution cs.CV · 2025-06-17 · unverdicted · none · ref 28 · internal anchor
FADPNet decomposes facial features into low- and high-frequency components processed by dedicated Mamba and CNN modules to balance quality and efficiency in face super-resolution.
An Efficient Self-Supervised Framework for Long-Sequence EEG Modeling cs.LG · 2025-02-25 · unverdicted · none · ref 10 · internal anchor
EEGM2 is a Mamba-2 integrated self-supervised model for EEG that claims linear complexity and state-of-the-art performance on long-sequence modeling and classification tasks.
EventCrab: Harnessing Frame and Point Synergy for Event-based Action Recognition and Beyond cs.CV · 2024-11-27 · unverdicted · none · ref 12 · internal anchor
EventCrab integrates frame and point networks with a joint representation space, SCL, and Hilbert-scan EPE to improve event-based action recognition by 5-7% on two datasets.
3DMambaComplete: Exploring Structured State Space Model for Point Cloud Completion cs.CV · 2024-04-10 · unverdicted · none · ref 11 · internal anchor
3DMambaComplete applies the Mamba model to point cloud completion via hyperpoint generation, spatial spreading, and mesh deformation, claiming better results than prior methods on benchmarks.
ZONOS2 Technical Report cs.SD · 2026-06-23 · unverdicted · none · ref 76 · internal anchor
ZONOS2 8B is a scaled MoE TTS model with 900M active parameters trained on 6M hours of data that reports competitive SOTA results on naturalness, speaker similarity, WER, and a new ZTTS1-Eval benchmark while releasing weights and code.
CUDA Kernel Optimization and Counter-Free Performance Analysis for Depthwise Convolution in Cloud Environments cs.DC · 2026-04-28 · unverdicted · none · ref 23 · internal anchor
Warp-tiled CUDA kernel for depthwise convolution delivers 3.26x runtime reduction versus naive baseline and 1.29x end-to-end training speedup using counter-free analysis in cloud settings.
ConvVitMamba: Efficient Multiscale Convolution, Transformer, and Mamba-Based Sequence modelling for Hyperspectral Image Classification cs.CV · 2026-04-20 · unverdicted · none · ref 12 · internal anchor
ConvVitMamba integrates multiscale convolution, transformer encoding, and Mamba-based refinement with PCA to outperform prior CNN, ViT, and Mamba methods in accuracy, size, and speed on four HSI benchmark datasets.
Deep Learning for Virtual Reality User Identification: A Benchmark cs.HC · 2026-03-14 · unverdicted · none · ref 14 · internal anchor
A benchmark study evaluates standard and emerging deep learning architectures on motion data from 71 VR users, establishing performance baselines for user identification.
Improving motor imagery decoding methods for an EEG-based mobile brain-computer interface in the context of the 2024 Cybathlon cs.HC · 2025-11-28 · conditional · none · ref 37 · internal anchor
A modular EEG-based BCI with S4D deep learning classifier achieves 84% offline accuracy and enables real-time control for a tetraplegic user, with 73% success in post-competition validation.
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights cs.CL · 2025-10-06 · unverdicted · none · ref 19 · internal anchor
This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-context performance, scaling, and efficiency to derive optimal design recipes.
SmileyLlama: Modifying Large Language Models for Directed Chemical Space Exploration physics.chem-ph · 2024-09-03 · unverdicted · none · ref 17 · internal anchor
SmileyLlama is an LLM transformed via SFT and DPO to generate valid novel drug-like molecules with user-specified properties and optimized 3D conformations for high binding affinity.
Attention Is not Everything: Efficient Alternatives for Vision cs.CV · 2026-04-19 · unverdicted · none · ref 36 · internal anchor
A survey that taxonomizes non-Transformer vision models and evaluates their practical trade-offs across efficiency, scalability, and robustness.
When control meets large language models: From words to dynamics eess.SY · 2026-02-03 · unverdicted · none · ref 240 · internal anchor
The paper proposes a bidirectional continuum between LLMs and control systems, covering LLM-assisted controller design, control-based LLM steering, and state-space modeling of LLMs.
A Survey on Efficient Inference for Large Language Models cs.CL · 2024-04-22 · accept · none · ref 66 · internal anchor
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
A Survey of Mamba cs.LG · 2024-08-02 · unverdicted · none · ref 59 · internal anchor
The paper consolidates existing research on Mamba models, their architecture variants, adaptations to different data modalities, and applications across domains.
Advancing Intelligent Sequence Modeling: Evolution, Trade-offs, and Applications of State- Space Architectures from S4 to Mamba cs.LG · 2025-03-22 · unverdicted · none · ref 2 · internal anchor
A survey tracing the evolution of state-space models like S4 and Mamba, their efficiency trade-offs, and applications in NLP, vision, and other domains.
Topological Neural Dynamics: A Neuron-wise Framework for Sequence Modeling cs.LG · 2026-06-19 · unreviewed · ref 29 · internal anchor
Simplified Sparse Attention via Gist Tokens cs.LG · 2026-04-22 · unreviewed · ref 18 · internal anchor
Next-Latent Prediction Transformers Learn Compact World Models cs.LG · 2025-11-08 · unreviewed · ref 12 · internal anchor
Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live cs.OS · 2025-11-04 · unreviewed · ref 29 · internal anchor
Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation cs.LG · 2025-09-04 · unreviewed · ref 33 · internal anchor

Efficiently Modeling Long Sequences with Structured State Spaces

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer