arxiv: 2111.00396 · v3 · submitted 2021-10-31 · 💻 cs.LG

Recognition: 3 theorem links

· Lean Theorem

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Christopher R\'e, Karan Goel

Pith reviewed 2026-05-11 10:37 UTC · model grok-4.3

classification 💻 cs.LG

keywords structured state spacesstate space modelslong-range dependenciessequence modelingS4Long Range ArenaCauchy kernel

0 comments

The pith

S4 uses a low-rank correction to the state matrix A so state space models can be computed efficiently via Cauchy kernels while retaining long-range power.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks a single sequence model that works across modalities and scales to sequences of 10,000 or more steps. Prior state space models could theoretically capture arbitrary long dependencies but were too slow and memory-heavy for practical use. By conditioning the state matrix with a low-rank correction, S4 stabilizes diagonalization and reduces the core computation to a well-studied Cauchy kernel. This yields a model that matches or beats Transformers on image and language tasks, runs generation 60 times faster, and reaches state-of-the-art accuracy on every Long Range Arena task, including the previously unsolved 16k-length Path-X problem.

Core claim

Conditioning the state matrix A of the continuous-time SSM with a low-rank correction permits stable diagonalization; the resulting system can be evaluated exactly by computing a Cauchy kernel, preserving the theoretical long-range modeling capacity of the underlying SSM at far lower time and memory cost.

What carries the argument

Low-rank correction to the state matrix A, which enables stable diagonalization of the SSM and reduces its evaluation to Cauchy kernel computation.

If this is right

A single architecture can now address sequential CIFAR-10 at 91% accuracy without augmentation or auxiliary losses, on par with a larger 2-D ResNet.
S4 closes most of the gap to Transformers on image and language modeling while generating sequences 60 times faster.
Every Long Range Arena task, including the 16k-step Path-X problem that defeated all earlier methods, is solved at state-of-the-art accuracy with the same efficiency as competing models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the Cauchy-kernel reduction preserves the SSM's theoretical properties across domains, the same parameterization may be applied directly to other continuous-time dynamical systems that currently rely on attention or recurrence.
The efficiency gain could allow scaling the same model to sequences of hundreds of thousands of steps in domains such as genomics or long video without changing the core mathematics.
Because S4 avoids the quadratic cost of attention yet matches its long-range performance, it supplies a concrete alternative architecture whose scaling behavior can be compared directly against Transformer variants on the same long-context benchmarks.

Load-bearing premise

The low-rank correction to the state matrix A permits stable diagonalization and the resulting Cauchy kernel computation fully preserves the theoretical long-range modeling strengths of the underlying SSM without introducing approximation errors that degrade performance on real data.

What would settle it

An experiment showing that S4 either loses accuracy relative to the uncorrected SSM on a long-range task such as Path-X or requires asymptotically more than linear time for sequences of length 16k.

read the original abstract

A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of $10000$ or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) $ x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t) $, and showed that for appropriate choices of the state matrix $ A $, this system could handle long-range dependencies mathematically and empirically. However, this method has prohibitive computation and memory requirements, rendering it infeasible as a general sequence modeling solution. We propose the Structured State Space sequence model (S4) based on a new parameterization for the SSM, and show that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths. Our technique involves conditioning $ A $ with a low-rank correction, allowing it to be diagonalized stably and reducing the SSM to the well-studied computation of a Cauchy kernel. S4 achieves strong empirical results across a diverse range of established benchmarks, including (i) 91\% accuracy on sequential CIFAR-10 with no data augmentation or auxiliary losses, on par with a larger 2-D ResNet, (ii) substantially closing the gap to Transformers on image and language modeling tasks, while performing generation $60\times$ faster (iii) SoTA on every task from the Long Range Arena benchmark, including solving the challenging Path-X task of length 16k that all prior work fails on, while being as efficient as all competitors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes the Structured State Space (S4) sequence model, a new parameterization of the continuous-time state space model (SSM) x'(t) = A x(t) + B u(t), y(t) = C x(t) + D u(t). By applying a low-rank correction to the state matrix A, S4 enables stable diagonalization and reduces the SSM to an exact Cauchy kernel computation. This yields an efficient model that preserves the theoretical long-range dependency modeling strengths of SSMs. Empirically, S4 reports strong results including 91% accuracy on sequential CIFAR-10, closing the gap to Transformers on image/language tasks with 60x faster generation, and state-of-the-art performance on all Long Range Arena tasks including solving the 16k-length Path-X task where prior methods fail, while matching competitor efficiency.

Significance. If the results hold, this work is significant as it delivers a principled, scalable alternative to Transformers and other sequence models for long-range dependencies across modalities. It combines the mathematical advantages of SSMs with practical efficiency via the structured Cauchy kernel, addressing a key limitation of prior SSM approaches. Credit is due for consistent empirical validation on established benchmarks like LRA and CIFAR-10, and for the explicit structural parameterization that avoids hidden approximations in the kernel computation.

major comments (1)

[Abstract, §3] Abstract and §3 (parameterization): the central claim that the low-rank correction to A 'permits stable diagonalization' and 'reduces the SSM to the well-studied computation of a Cauchy kernel' while exactly preserving long-range modeling power lacks an ablation isolating the correction's contribution. Without this, it is difficult to confirm that performance on Path-X (length 16k) and other LRA tasks stems from the claimed structural properties rather than hyperparameter tuning or other implementation details.

minor comments (2)

[Experiments section] The manuscript would benefit from expanded details on training procedures, hyperparameter sensitivity, and full experimental setup (e.g., optimizer, learning rate schedules, and data preprocessing) to support reproducibility of the reported SoTA numbers.
[§3] Notation for the low-rank correction parameters and the resulting diagonalized form could be clarified with an explicit equation showing how the correction is applied to A before diagonalization.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. We address the major comment below.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (parameterization): the central claim that the low-rank correction to A 'permits stable diagonalization' and 'reduces the SSM to the well-studied computation of a Cauchy kernel' while exactly preserving long-range modeling power lacks an ablation isolating the correction's contribution. Without this, it is difficult to confirm that performance on Path-X (length 16k) and other LRA tasks stems from the claimed structural properties rather than hyperparameter tuning or other implementation details.

Authors: We thank the referee for this constructive comment. Section 3 derives that the low-rank correction to A is what permits stable diagonalization (avoiding the numerical instability of the HiPPO matrix) and reduces the SSM kernel computation exactly to a Cauchy matrix, thereby preserving the theoretical long-range modeling properties without approximation. While this is a mathematical property rather than an empirical one, we agree that an explicit ablation would strengthen the presentation. We will add such an ablation to the revised manuscript, comparing the full S4 model against an SSM variant without the low-rank correction on the LRA benchmark (including Path-X) to isolate its contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper starts from the standard continuous-time SSM equations x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t) and introduces a new parameterization via low-rank correction to A. This choice is explicitly structural, enabling stable diagonalization and exact reduction to Cauchy kernel computation as a direct mathematical consequence rather than a redefinition or fit. Reported results consist of empirical performance on external benchmarks (sequential CIFAR-10, LRA tasks including Path-X of length 16k) that are not quantities predicted or fitted by construction from the model's own inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to force the central claims; the efficiency and long-range modeling preservation follow from the closed-form kernel structure and are validated externally. The derivation remains self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the SSM framework from prior work plus the new low-rank correction enabling efficient exact computation; no new physical entities are postulated.

free parameters (1)

low-rank correction parameters
The rank and values of the correction term added to A are part of the model and learned from data.

axioms (1)

domain assumption For appropriate choices of A the continuous SSM can capture long-range dependencies mathematically
Invoked in the abstract as the foundation that prior SSM work established but could not compute efficiently.

pith-pipeline@v0.9.0 · 5616 in / 1357 out tokens · 46526 ms · 2026-05-11T10:37:10.486355+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo
cond-mat.str-el 2026-05 conditional novelty 7.0

PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.
Selection, Not Fusion: Radar-Modulated State Space Models for Radar-Camera Depth Estimation
cs.CV 2026-05 unverdicted novelty 7.0

Radar-Modulated Selection perturbs only the step size Δ and readout C parameters inside Mamba's selective scan with radar data while keeping other components image-only, yielding state-of-the-art depth estimation on n...
TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles
cs.CV 2026-05 unverdicted novelty 7.0

TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
TIDES: Implicit Time-Awareness in Selective State Space Models
cs.LG 2026-05 unverdicted novelty 7.0

TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and P...
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
cs.LG 2026-05 unverdicted novelty 7.0

PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
cs.CV 2026-05 unverdicted novelty 7.0

NOVA represents world states as INR weights for decoder-free rendering, compactness, and unsupervised disentanglement of background, foreground, and motion in video world models.
How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences
cs.LG 2026-05 unverdicted novelty 7.0

In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate ...
The Predictive-Causal Gap: An Impossibility Theorem and Large-Scale Neural Evidence
cs.LG 2026-05 unverdicted novelty 7.0

Predictive representation learning structurally favors encoding slower or less noisy environment modes over causal system modes, as shown by an impossibility theorem for linear-Gaussian dynamics and large-scale neural...
FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning
cs.LG 2026-05 unverdicted novelty 7.0

FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connect...
Rethink MAE with Linear Time-Invariant Dynamics
cs.CV 2026-04 unverdicted novelty 7.0

Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.
Mamba Sequence Modeling meets Model Predictive Control
math.OC 2026-04 unverdicted novelty 7.0

Mamba-MPC stabilizes and tracks references on SISO and MIMO systems in simulation and hardware while outperforming LSTM-MPC with faster computation.
RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

RSGMamba introduces a reliability-aware self-gated Mamba block for dynamic cross-modal feature selection in semantic segmentation, delivering state-of-the-art mIoU on RGB-D and RGB-T benchmarks with 48.6M parameters.
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
cs.CV 2024-01 conditional novelty 7.0

Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.
A Single-Layer Model Can Do Language Modeling
cs.CL 2026-05 unverdicted novelty 6.0

A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).
Continuity Laws for Sequential Models
cs.LG 2026-05 unverdicted novelty 6.0

S4 models exhibit stable time-continuity unlike sensitive S6 models, with task continuity predicting performance and enabling temporal subsampling for better efficiency.
EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction
cs.CV 2026-05 unverdicted novelty 6.0

EmambaIR is a visual state space model with cross-modal top-k sparse attention and gated SSM components that outperforms prior CNN and ViT methods on event-guided deblurring, deraining, and HDR reconstruction while re...
StreamPhy: Streaming Inference of High-Dimensional Physical Dynamics via State Space Models
cs.LG 2026-05 unverdicted novelty 6.0

StreamPhy enables accurate streaming inference of full physical dynamics from irregular sparse data via an adaptive encoder, structured state-space model, and expressive FT-FiLM decoder, with claimed 48% accuracy gain...
Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators
cs.LG 2026-05 unverdicted novelty 6.0

Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.
Cubit: Token Mixer with Kernel Ridge Regression
cs.LG 2026-05 unverdicted novelty 6.0

Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.
Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
cs.CV 2026-05 unverdicted novelty 6.0

NOVA represents scene states as INR weights for analytical rendering without decoders and achieves structural disentanglement of content and dynamics in video world models.
Training Transformers for KV Cache Compressibility
cs.LG 2026-05 unverdicted novelty 6.0

KV compressibility is a property of learned transformer representations that can be improved by training with KV sparsification, leading to better quality-budget tradeoffs in downstream compression for retrieval, QA, ...
Training Transformers for KV Cache Compressibility
cs.LG 2026-05 unverdicted novelty 6.0

Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.
Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Hidden states in recurrent RL policies correspond to PMP co-states, so a derived co-state loss structures the dynamics and yields robust performance on partially observable continuous control tasks.
ZAYA1-8B Technical Report
cs.AI 2026-05 unverdicted novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
The Impossibility Triangle of Long-Context Modeling
cs.CL 2026-05 unverdicted novelty 6.0

No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

SST V2 introduces parallel-trainable nonlinear recurrence in latent space to let transformers reason continuously across positions, delivering +15 points on GPQA-Diamond and halving remaining GSM8K errors over matched...
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
cs.CL 2026-04 unverdicted novelty 6.0

HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
FETS Benchmark: Foundation Models Outperform Dataset-specific Machine Learning in Energy Time Series Forecasting
cs.LG 2026-04 unverdicted novelty 6.0

Foundation models outperform dataset-specific machine learning in energy time series forecasting across 54 datasets in 9 categories.
An explicit operator explains end-to-end computation in the modern neural networks used for sequence and language modeling
cs.NE 2026-04 unverdicted novelty 6.0

S4D state space models correspond exactly to wave propagation and nonlinear wave interactions in a one-dimensional ring oscillator network, with a closed-form operator describing the complete input-output map.
Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention
cs.LG 2026-04 unverdicted novelty 6.0

Gist Sparse Attention uses learnable gist compression tokens as both summaries and routing signals, then selectively unfolds relevant raw chunks for fine-grained attention, outperforming compression and sparse-attenti...
Hero-Mamba: Mamba-based Dual Domain Learning for Underwater Image Enhancement
cs.CV 2026-04 unverdicted novelty 6.0

Hero-Mamba combines parallel spatial-spectral Mamba processing and a background-light-guided ColorFusion block to enhance underwater images, reporting PSNR 25.802 and SSIM 0.913 on the LSUI benchmark.
Event-Adaptive State Transition and Gated Fusion for RGB-Event Object Tracking
cs.CV 2026-04 unverdicted novelty 6.0

MambaTrack improves RGB-Event object tracking via event-adaptive state transitions in a Dynamic State Space Model and a Gated Projection Fusion module, reporting state-of-the-art results on FE108 and FELT datasets.
TCL: Enabling Fast and Efficient Cross-Hardware Tensor Program Optimization via Continual Learning
cs.LG 2026-04 conditional novelty 6.0

TCL delivers 16.8x faster tuning on CPU and 12.48x on GPU with modestly lower inference latency by combining RDU active sampling, a lightweight Mamba cost model, and cross-platform continual knowledge distillation.
RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction
cs.LG 2026-04 unverdicted novelty 6.0

RetentiveKV uses entropy to drive state-space model transitions that retain and reactivate low-attention visual tokens in a continuous memory instead of pruning them, delivering 5x KV cache compression and 1.5x faster...
Structured State-Space Regularization for Compact and Generation-Friendly Image Tokenization
cs.CV 2026-04 unverdicted novelty 6.0

A new regularizer transfers frequency awareness from state-space models into image tokenizers, yielding more compact latents that improve diffusion-model generation quality with little reconstruction penalty.
Membership Inference Attacks Expose Participation Privacy in ECG Foundation Encoders
cs.LG 2026-04 unverdicted novelty 6.0

Membership inference attacks can detect whether specific ECG data participated in pretraining self-supervised foundation encoders, with leakage strongest in small cohorts and contrastive models.
Tracking Listener Attention: Gaze-Guided Audio-Visual Speech Enhancement Framework
eess.AS 2026-04 unverdicted novelty 6.0

The GG-AVSE framework uses listener gaze direction combined with YOLO5Face and AVSEMamba to resolve target-speaker ambiguity in audio-visual speech enhancement, yielding gains in PESQ, STOI, and SI-SDR.
CloudMamba: An Uncertainty-Guided Dual-Scale Mamba Network for Cloud Detection in Remote Sensing Imagery
cs.CV 2026-04 unverdicted novelty 6.0

CloudMamba combines uncertainty-guided refinement with a dual-scale Mamba network to outperform prior methods on cloud segmentation accuracy while maintaining linear computational cost.
Physics-Aligned Spectral Mamba: Decoupling Semantics and Dynamics for Few-Shot Hyperspectral Target Detection
cs.CV 2026-04 unverdicted novelty 6.0

SpecMamba decouples stable semantic features from agile spectral adaptation via DCT-Mamba adapters, prior-guided tri-encoders, and self-supervised test-time mapping to improve few-shot hyperspectral target detection.
Kimi Linear: An Expressive, Efficient Attention Architecture
cs.CL 2025-10 unverdicted novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
Retentive Network: A Successor to Transformer for Large Language Models
cs.CL 2023-07 unverdicted novelty 6.0

RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.
Beyond Similarity: Temporal Operator Attention for Time Series Analysis
cs.LG 2026-05 unverdicted novelty 5.0

Temporal Operator Attention augments softmax attention with learnable sequence-space operators for signed temporal mixing and uses stochastic regularization to enable practical training, yielding consistent gains on t...
Kaczmarz Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
mHC-SSM: Manifold-Constrained Hyper-Connections for State Space Language Models with Stream-Specialized Adapters
cs.LG 2026-05 unverdicted novelty 5.0

Manifold-constrained multi-stream mixing plus per-stream adapters improves SSM language model validation loss from 6.3507 to 6.1353 and perplexity from 572.91 to 461.88 on WikiText-2.
StreamPhy: Streaming Inference of High-Dimensional Physical Dynamics via State Space Models
cs.LG 2026-05 unverdicted novelty 5.0

StreamPhy introduces an end-to-end streaming framework using state-space models and an expressive FT-FiLM decoder to infer continuous physical dynamics from irregular sparse data, claiming 48% better accuracy and 20-1...
Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 5.0

Recurrent RL policies can have their hidden states aligned with PMP co-states through a derived loss, yielding robust performance on partially observable control tasks.
SAMIC: A Lightweight Semantic-Aware Mamba for Efficient Perceptual Image Compression
cs.CV 2026-05 unverdicted novelty 5.0

SAMIC introduces semantic-aware Mamba blocks and SVD-based redundancy reduction to achieve efficient perceptual image compression with improved rate-distortion-perception tradeoffs.
Selective Attention-Based Network for Robust Infrared Small Target Detection
cs.CV 2026-04 unverdicted novelty 5.0

SANet augments U-Net with a Dual-path Semantic-aware Module using pinwheel convolutions and CBAM, plus a Selective Attention Fusion Module for adaptive cross-scale feature fusion, to improve detection of sub-pixel inf...
Looking Into the Past: Eye Movements Characterize Elements of Autobiographical Recall in Interviews with Holocaust Survivors
cs.MM 2026-04 unverdicted novelty 5.0

Eye movements during Holocaust survivor interviews vary by episodic, semantic, affective and temporal memory dimensions, with pre-onset gaze sufficient to predict sentence temporal context.
Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction
cs.MM 2026-04 unverdicted novelty 5.0

A new joint spatio-temporal enlargement model for micro-video popularity prediction using frame scoring for long sequences and a topology-aware memory bank for unbounded historical associations.
FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control
cs.LG 2026-04 unverdicted novelty 5.0

FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.
Sessa: Selective State Space Attention
cs.LG 2026-04 unverdicted novelty 5.0

Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.
MedMamba: Recasting Mamba for Medical Time Series Classification
eess.SP 2026-04 unverdicted novelty 5.0

MedMamba introduces a principle-guided bidirectional multi-scale Mamba model that outperforms prior methods on EEG, ECG, and activity classification benchmarks while delivering 4.6x inference speedup.
A Mamba-Based Multimodal Network for Multiscale Blast-Induced Rapid Structural Damage Assessment
cs.AI 2026-04 unverdicted novelty 5.0

A new Mamba multimodal network integrates multi-scale blast-loading information with satellite images to improve rapid structural damage assessment after explosions, showing gains over prior methods on the Beirut 2020 case.
CARE-ECG: Causal Agent-based Reasoning for Explainable and Counterfactual ECG Interpretation
cs.LG 2026-04 unverdicted novelty 5.0

CARE-ECG unifies ECG representation learning, causal graph-based diagnosis, and counterfactual assessment in an agentic LLM pipeline to improve accuracy and explanation faithfulness.
HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment
cs.CV 2026-04 unverdicted novelty 5.0

HST-HGN uses heterogeneous spatial-temporal hypergraph networks combined with bidirectional Mamba state space models to achieve state-of-the-art driver fatigue assessment from untrimmed videos while maintaining comput...
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
cs.CV 2026-04 unverdicted novelty 5.0

Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
CUDA Kernel Optimization and Counter-Free Performance Analysis for Depthwise Convolution in Cloud Environments
cs.DC 2026-04 unverdicted novelty 4.0

Warp-tiled CUDA kernel for depthwise convolution delivers 3.26x runtime reduction versus naive baseline and 1.29x end-to-end training speedup using counter-free analysis in cloud settings.
ConvVitMamba: Efficient Multiscale Convolution, Transformer, and Mamba-Based Sequence modelling for Hyperspectral Image Classification
cs.CV 2026-04 unverdicted novelty 4.0

ConvVitMamba integrates multiscale convolution, transformer encoding, and Mamba-based refinement with PCA to outperform prior CNN, ViT, and Mamba methods in accuracy, size, and speed on four HSI benchmark datasets.
Attention Is not Everything: Efficient Alternatives for Vision
cs.CV 2026-04 unverdicted novelty 3.0

A survey that taxonomizes non-Transformer vision models and evaluates their practical trade-offs across efficiency, scalability, and robustness.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 56 Pith papers · 4 internal anchors

[1]

Unitary evolution recurrent neural networks

Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In The International Conference on Machine Learning (ICML) , pages 1120–1128, 2016

work page 2016
[2]

arXiv preprint arXiv:1809.10853 , year=

Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853, 2018

work page arXiv 2018
[3]

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 , 2018

work page internal anchor Pith review arXiv 2018
[4]

Trellis networks for sequence modeling

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Trellis networks for sequence modeling. In The International Conference on Learning Representations (ICLR) , 2019

work page 2019
[5]

Dilated recurrent neural networks

Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui, Michael Witbrock, Mark Hasegawa-Johnson, and Thomas S Huang. Dilated recurrent neural networks. In Advances in Neural Information Processing Systems (NeurIPS) , 2017

work page 2017
[6]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[7]

Parallelizing legendre memory unit training

Narsimha Chilkuri and Chris Eliasmith. Parallelizing legendre memory unit training. The International Conference on Machine Learning (ICML) , 2021

work page 2021
[8]

Rethinking attention with performers

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. In The International Conference on Learning Representations (ICLR) , 2020

work page 2020
[9]

Language modeling with gated convolutional networks

Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In International conference on machine learning , pages 933–941. PMLR, 2017

work page 2017
[10]

Gru-ode-bayes: Continuous modeling of sporadically-observed time series

Edward De Brouwer, Jaak Simm, Adam Arany, and Yves Moreau. Gru-ode-bayes: Continuous modeling of sporadically-observed time series. In Advances in Neural Information Processing Systems (NeurIPS) , 2019

work page 2019
[11]

Adversarial audio synthesis

Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. In ICLR, 2019

work page 2019
[12]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 , 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[13]

Lipschitz recurrent neural networks

N Benjamin Erichson, Omri Azencot, Alejandro Queiruga, Liam Hodgkinson, and Michael W Mahoney. Lipschitz recurrent neural networks. In International Conference on Learning Representations, 2021

work page 2021
[14]

It’s raw! audio generation with state-space models

Karan Goel, Albert Gu, Chris Donahue, and Christopher R´ e. It’s raw! audio generation with state-space models. arXiv preprint arXiv:2202.09729 , 2022

work page arXiv 2022
[15]

Matrix computations, volume 3

Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU press, 2013

work page 2013
[16]

Hippo: Recurrent memory with optimal polynomial projections

Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher R´ e. Hippo: Recurrent memory with optimal polynomial projections. In Advances in Neural Information Processing Systems (NeurIPS) , 2020

work page 2020
[17]

Improving the gating mechanism of recurrent neural networks

Albert Gu, Caglar Gulcehre, Tom Le Paine, Matt Hoﬀman, and Razvan Pascanu. Improving the gating mechanism of recurrent neural networks. In The International Conference on Machine Learning (ICML) , 2020. 13

work page 2020
[18]

Combining recurrent, convolutional, and continuous-time models with the structured learnable linear state space layer

Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher R´ e. Combining recurrent, convolutional, and continuous-time models with the structured learnable linear state space layer. In Advances in Neural Information Processing Systems (NeurIPS) , 2021

work page 2021
[19]

On the parameterization and initialization of diagonal state space models

Albert Gu, Ankit Gupta, Karan Goel, and Christopher R´ e. On the parameterization and initialization of diagonal state space models. arXiv preprint arXiv:2206.11893 , 2022

work page arXiv 2022
[20]

arXiv preprint arXiv:2206.12037 , title =

Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, and Christopher R´ e. How to train your hippo: State space models with generalized basis projections. arXiv preprint arXiv:2206.12037 , 2022

work page arXiv 2022
[21]

Long short-term memory.Neural computation, 9(8):1735–1780, 1997

Sepp Hochreiter and J¨ urgen Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997

work page 1997
[22]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Fran¸ cois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156–5165. PMLR, 2020

work page 2020
[23]

Neural controlled diﬀerential equations for irregular time series

Patrick Kidger, James Morrill, James Foster, and Terry Lyons. Neural controlled diﬀerential equations for irregular time series. arXiv preprint arXiv:2005.08926 , 2020

work page arXiv 2005
[24]

Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group

Mario Lezcano-Casado and David Mart´ ınez-Rubio. Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group. In The International Conference on Machine Learning (ICML), 2019

work page 2019
[25]

Independently recurrent neural network (IndRNN): Building a longer and deeper RNN

Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao. Independently recurrent neural network (IndRNN): Building a longer and deeper RNN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5457–5466, 2018

work page 2018
[26]

Time-aware large kernel convolutions

Vasileios Lioutas and Yuhong Guo. Time-aware large kernel convolutions. In International Conference on Machine Learning, pages 6172–6183. PMLR, 2020

work page 2020
[27]

Scalable language modeling: Wikitext-103 on a single gpu in 12 hours

Stephen Merity, Nitish Shirish Keskar, James Bradbury, and Richard Socher. Scalable language modeling: Wikitext-103 on a single gpu in 12 hours. SysML, 2018

work page 2018
[28]

WaveNet: A Generative Model for Raw Audio

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 , 2016

work page internal anchor Pith review arXiv 2016
[29]

Structured matrices and polynomials: uniﬁed superfast algorithms

Victor Pan. Structured matrices and polynomials: uniﬁed superfast algorithms . Springer Science & Business Media, 2001

work page 2001
[30]

Fast approximate computations with cauchy matrices and polynomials

Victor Pan. Fast approximate computations with cauchy matrices and polynomials. Mathematics of Computation, 86(308):2799–2826, 2017

work page 2017
[31]

Transformations of matrix structures work again

Victor Y Pan. Transformations of matrix structures work again. Linear Algebra and Its Applications , 465:107–138, 2015

work page 2015
[32]

On the diﬃculty of training recurrent neural networks

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the diﬃculty of training recurrent neural networks. In International conference on machine learning , pages 1310–1318, 2013

work page 2013
[33]

Fast parametric learning with activation memorization

Jack Rae, Chris Dyer, Peter Dayan, and Timothy Lillicrap. Fast parametric learning with activation memorization. The International Conference on Machine Learning (ICML) , 2018

work page 2018
[34]

Fast generation for convolutional autoregressive models

Prajit Ramachandran, Tom Le Paine, Pooya Khorrami, Mohammad Babaeizadeh, Shiyu Chang, Yang Zhang, Mark A Hasegawa-Johnson, Roy H Campbell, and Thomas S Huang. Fast generation for convolutional autoregressive models. arXiv preprint arXiv:1704.06001 , 2017

work page arXiv 2017
[35]

Ckconv: Continuous kernel convolution for sequential data

David W Romero, Anna Kuzina, Erik J Bekkers, Jakub M Tomczak, and Mark Hoogendoorn. Ckconv: Continuous kernel convolution for sequential data. arXiv preprint arXiv:2102.02611 , 2021. 14

work page arXiv 2021
[36]

Flexconv: Continuous kernel convolutions with diﬀerentiable kernel sizes

David W Romero, Robert-Jan Bruintjes, Jakub M Tomczak, Erik J Bekkers, Mark Hoogendoorn, and Jan C van Gemert. Flexconv: Continuous kernel convolutions with diﬀerentiable kernel sizes. In The International Conference on Learning Representations (ICLR) , 2022

work page 2022
[37]

Latent ordinary diﬀerential equations for irregularly-sampled time series

Yulia Rubanova, Tian Qi Chen, and David K Duvenaud. Latent ordinary diﬀerential equations for irregularly-sampled time series. In Advances in Neural Information Processing Systems, pages 5321–5331, 2019

work page 2019
[38]

Unicornn: A recurrent model for learning very long time dependencies

T Konstantin Rusch and Siddhartha Mishra. Unicornn: A recurrent model for learning very long time dependencies. The International Conference on Machine Learning (ICML) , 2021

work page 2021
[39]

PixelCNN++: Improving the pixelcnn with dis- cretized logistic mixture likelihood and other modiﬁcations

Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modiﬁcations. arXiv preprint arXiv:1701.05517 , 2017

work page arXiv 2017
[40]

Long range arena : A benchmark for eﬃcient transformers

Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena : A benchmark for eﬃcient transformers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=qVyeW-grC2k

work page 2021
[41]

CoRR, abs/2105.01601

Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, et al. Mlp-mixer: An all-mlp architecture for vision. arXiv preprint arXiv:2105.01601 , 2021

work page arXiv 2021
[42]

Learning longer-term dependencies in RNNs with auxiliary losses

Trieu H Trinh, Andrew M Dai, Minh-Thang Luong, and Quoc V Le. Learning longer-term dependencies in RNNs with auxiliary losses. In The International Conference on Machine Learning (ICML) , 2018

work page 2018
[43]

A method of analysing the behaviour of linear systems in terms of time series

Arnold Tustin. A method of analysing the behaviour of linear systems in terms of time series. Journal of the Institution of Electrical Engineers-Part IIA: Automatic Regulators and Servo Mechanisms , 94(1): 130–142, 1947

work page 1947
[44]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[45]

Legendre memory units: Continuous-time representation in recurrent neural networks

Aaron Voelker, Ivana Kaji´ c, and Chris Eliasmith. Legendre memory units: Continuous-time representation in recurrent neural networks. In Advances in Neural Information Processing Systems, pages 15544–15553, 2019

work page 2019
[46]

Dynamical systems in spiking neuromorphic hardware

Aaron Russell Voelker. Dynamical systems in spiking neuromorphic hardware . PhD thesis, University of Waterloo, 2019

work page 2019
[47]

Speech commands: A dataset for limited-vocabulary speech recognition.arXiv preprint arXiv:1804.03209, 2018

Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition. ArXiv, abs/1804.03209, 2018

work page arXiv 2018
[48]

Inverting modiﬁed matrices

Max A Woodbury. Inverting modiﬁed matrices. Memorandum report, 42:106, 1950

work page 1950
[49]

Pay less attention with lightweight and dynamic convolutions

Felix Wu, Angela Fan, Alexei Baevski, Yann N Dauphin, and Michael Auli. Pay less attention with lightweight and dynamic convolutions. In The International Conference on Learning Representations (ICLR), 2019

work page 2019
[50]

Informer: Beyond eﬃcient transformer for long sequence time-series forecasting

Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond eﬃcient transformer for long sequence time-series forecasting. In The Thirty-Fifth AAAI Conference on Artiﬁcial Intelligence, AAAI 2021, Virtual Conference , volume 35, pages 11106– 11115. AAAI Press, 2021. 15 A Discussion Related Work. Our work...

work page 2021
[51]

Gu et al

derived a non-trainable SSM motivated from approximating a neuromorphic spiking model, and Chilkuri and Eliasmith [7] showed that it could be sped up at train time with a convolutional view. Gu et al. [16] extended this special case to a general continuous-time function approximation framework with several more special cases of A matrices designed for lon...

work page
[52]

Our S4 model uses the same Transformer backbone as in [ 2]

and many more. Our S4 model uses the same Transformer backbone as in [ 2]. The model consists of 16 blocks of S4 layers alternated with position-wise feedforward layers, with a feature dimension of 1024. Because our S4 layer has around 1/4 the number of parameters as a self-attention layer with the same dimension, we made two modiﬁcations to match the par...

work page 2040