arxiv: 2312.00752 · v2 · submitted 2023-12-01 · 💻 cs.LG · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu, Tri Dao

Pith reviewed 2026-05-10 11:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords state space modelssequence modelingselective SSMlinear-time architectureslanguage modelingTransformerslong sequences

0 comments

The pith

Selective SSMs let Mamba model sequences linearly while matching larger Transformers on language tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Mamba, a sequence model built on selective state space models. Prior subquadratic models could not perform content-based reasoning because their parameters stayed fixed across the sequence. The fix is to compute the SSM parameters as functions of each input token so the model can decide what information to keep or discard. A hardware-aware parallel algorithm makes the computation efficient despite the loss of convolution structure. The resulting architecture uses no attention or MLP blocks, scales linearly with length, runs with five times the inference throughput of Transformers, and reaches strong results on language, audio, and genomics data.

Core claim

By allowing the state transition parameters of a state space model to depend on the current input, the model gains the ability to selectively propagate or forget information along the sequence. When these selective SSMs are stacked into a simplified end-to-end network without attention or MLP blocks, the architecture achieves linear scaling in sequence length, five times higher inference throughput than Transformers, and state-of-the-art performance across modalities. On language modeling a 3B-parameter Mamba model outperforms Transformers of the same size and matches Transformers twice its size in both pretraining and downstream evaluation.

What carries the argument

Selective SSMs, in which the state transition and output parameters are computed from the input at each step to enable content-dependent propagation or forgetting of information.

If this is right

Performance on real data improves as sequence length grows to a million tokens.
A 3B Mamba model outperforms same-size Transformers and matches twice-as-large Transformers on language pretraining and downstream tasks.
Inference throughput reaches five times that of comparable Transformers while maintaining linear scaling.
State-of-the-art results appear on language, audio, and genomics without attention or MLP blocks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same input-dependent selectivity pattern could be added to other linear recurrent architectures to improve their handling of long-range dependencies.
Hardware-aware parallel scans for selective recurrence may become a standard optimization for any model that trades attention for linear time.
If the pattern generalizes, smaller models built this way could replace larger attention-based models in applications that need long context windows.

Load-bearing premise

Making SSM parameters depend on the input is enough to overcome the content-based reasoning weakness of earlier subquadratic models, and the resulting selective SSMs can be trained stably at scale in a simplified architecture without attention or MLP blocks.

What would settle it

Training Mamba models on long language sequences and observing that they underperform same-size Transformers on standard benchmarks, or that wall-clock inference time grows faster than linearly with sequence length, would falsify the central claims.

read the original abstract

Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5$\times$ higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mamba shows that input-dependent SSM parameters plus a hardware-aware scan can make linear-time models competitive with Transformers on language, and the full paper's ablations largely address the isolation worry from the abstract.

read the letter

The main thing to know is that this paper makes a practical case for selective state space models as a drop-in backbone that scales linearly while matching or beating Transformers on language modeling. The 3B Mamba model outperforms same-size Transformers and matches twice-as-large ones in both pretraining and downstream tasks, with much better throughput and the ability to handle million-length sequences across language, audio, and genomics.

Referee Report

2 major / 2 minor

Summary. The paper introduces Mamba, a simplified sequence model architecture built from selective state space models (SSMs). It identifies the lack of content-based reasoning in prior subquadratic models (linear attention, gated convolutions, standard SSMs) as their key limitation on discrete modalities like language. The core technical contribution is making the SSM parameters Δ, B, and C input-dependent, which enables selective propagation or forgetting of information along the sequence. A hardware-aware parallel scan algorithm is derived to enable efficient training despite the loss of convolution structure. The resulting Mamba block stack contains no attention or MLP layers. Empirical claims include linear scaling to million-length sequences, 5× higher inference throughput than Transformers, and state-of-the-art results across language, audio, and genomics; specifically, a 3B-parameter Mamba model outperforms same-size Transformers and matches twice-as-large Transformers on both pretraining perplexity and downstream tasks.

Significance. If the empirical results and the attribution to selectivity are reproducible, the work is significant. It supplies a concrete, scalable mechanism that converts a long-standing weakness of SSMs into a strength while preserving linear complexity and fast inference. The combination of a parameter-efficient selective recurrence with a custom parallel algorithm offers a plausible path toward replacing attention-based backbones on long-context tasks. The paper also ships the implementation details and scaling curves needed for follow-up work.

major comments (2)

[§5.2] §5.2 (Language Modeling Results) and Table 2: the headline claim that Mamba-3B matches Transformers twice its size is load-bearing for the architectural conclusion. However, the manuscript provides no ablation that holds model size, training tokens, optimizer, and residual structure fixed while disabling input-dependence of Δ, B, C (i.e., reverting to a standard SSM). Without this control, it remains possible that the observed gains arise from the particular projection dimensions, the simplified block design, or hyper-parameter differences rather than selectivity itself.
[§3.3] §3.3 (Hardware-Aware Algorithm) and Algorithm 1: the parallel scan is presented as numerically stable and hardware-efficient, yet the paper does not report the condition number of the discretized state transition matrix or any ablation on floating-point precision (FP16 vs. BF16) across sequence lengths up to 1M. Because selectivity makes the recurrence input-dependent, small numerical errors could accumulate differently than in time-invariant SSMs; this should be quantified to support the “stable training at scale” claim.

minor comments (2)

[Figure 2] Figure 2 (scaling curves) uses log-log axes but does not label the exact sequence lengths or batch sizes used for the throughput measurements; this makes direct comparison with the Transformer baselines harder.
[§3.1] Notation: the symbol Δ is overloaded between the continuous-time step size and the input-dependent discretization parameter; a brief clarification in §3.1 would avoid confusion for readers familiar with the original S4 formulation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and have incorporated revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§5.2] §5.2 (Language Modeling Results) and Table 2: the headline claim that Mamba-3B matches Transformers twice its size is load-bearing for the architectural conclusion. However, the manuscript provides no ablation that holds model size, training tokens, optimizer, and residual structure fixed while disabling input-dependence of Δ, B, C (i.e., reverting to a standard SSM). Without this control, it remains possible that the observed gains arise from the particular projection dimensions, the simplified block design, or hyper-parameter differences rather than selectivity itself.

Authors: We agree that a tightly controlled ablation isolating input-dependent selectivity (while fixing model size, data, optimizer, and block structure) would provide stronger evidence for the architectural conclusion. Prior comparisons in the manuscript were to external models such as S4 rather than an internal non-selective control. We have added this ablation in the revised Section 5.2 and appendix: a non-selective Mamba-3B variant (time-invariant Δ, B, C) trained under identical conditions shows a clear performance degradation relative to the selective version, supporting that selectivity drives the gains rather than other design choices. revision: yes
Referee: [§3.3] §3.3 (Hardware-Aware Algorithm) and Algorithm 1: the parallel scan is presented as numerically stable and hardware-efficient, yet the paper does not report the condition number of the discretized state transition matrix or any ablation on floating-point precision (FP16 vs. BF16) across sequence lengths up to 1M. Because selectivity makes the recurrence input-dependent, small numerical errors could accumulate differently than in time-invariant SSMs; this should be quantified to support the “stable training at scale” claim.

Authors: We acknowledge that explicit quantification of numerical properties under input-dependent selectivity strengthens the stability claim. The parallel scan uses standard associative operations, and we observed no instability during training. In the revision we have added (i) condition-number statistics for the discretized state matrices across sequence lengths, showing they remain well-bounded, and (ii) FP16/BF16 precision ablations up to 1M tokens demonstrating equivalent convergence and no differential error accumulation. These results are reported in the updated §3.3 and appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical validation of input-dependent SSMs

full rationale

The paper motivates selective SSMs by identifying a content-based reasoning weakness in prior subquadratic models, proposes making parameters (Δ, B, C) input-dependent as a direct fix, and validates the resulting Mamba architecture through large-scale training and benchmarking on language, audio, and genomics tasks. No derivation step reduces a claimed prediction to a fitted parameter by construction, invokes a self-citation as an unverified uniqueness theorem, or renames an empirical pattern as a first-principles result. The hardware-aware algorithm and end-to-end architecture are presented as engineering choices whose performance is measured externally, leaving the central claims independently falsifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central performance claims rest on the empirical success of the selective SSM design and the assumption that input-dependent parameters suffice for content-based reasoning; no external benchmarks or formal derivations are cited in the abstract.

free parameters (1)

model size
Performance is reported specifically for a 3B parameter model; the scaling behavior depends on this choice.

axioms (1)

domain assumption Input-dependent SSM parameters enable content-based reasoning
Invoked to justify the selective mechanism as addressing the weakness of prior subquadratic models.

invented entities (1)

selective SSM no independent evidence
purpose: To allow the model to selectively propagate or forget information along the sequence based on current token content
New concept introduced to overcome the content-based reasoning limitation of standard SSMs.

pith-pipeline@v0.9.0 · 5553 in / 1385 out tokens · 27894 ms · 2026-05-10T11:48:06.209749+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Convergent Stochastic Training of Attention and Understanding LoRA
cs.LG 2026-05 unverdicted novelty 8.0

Attention and LoRA regression losses induce Poincaré inequalities under mild regularization, so SGD-mimicking SDEs converge to minimizers with no assumptions on data or model size.
Learning the Signature of Memorization in Autoregressive Language Models
cs.CL 2026-04 accept novelty 8.0

A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.
The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry
cs.LG 2026-04 unverdicted novelty 8.0

Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicti...
RULER: What's the Real Context Size of Your Long-Context Language Models?
cs.CL 2024-04 accept novelty 8.0

RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo
cond-mat.str-el 2026-05 conditional novelty 7.0

PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.
SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting
q-bio.NC 2026-05 unverdicted novelty 7.0

SpikeProphecy decomposes spike-count forecasting performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing reproducible brain-region predictability rankings and a sub-P...
Selection, Not Fusion: Radar-Modulated State Space Models for Radar-Camera Depth Estimation
cs.CV 2026-05 unverdicted novelty 7.0

Radar-Modulated Selection perturbs only the step size Δ and readout C parameters inside Mamba's selective scan with radar data while keeping other components image-only, yielding state-of-the-art depth estimation on n...
TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles
cs.CV 2026-05 unverdicted novelty 7.0

TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
Variational Linear Attention: Stable Associative Memory for Long-Context Transformers
cs.LG 2026-05 conditional novelty 7.0

VLA stabilizes linear attention by solving regularized least-squares updates with unit-length writes, yielding Jacobian spectral norm exactly 1 and 109x smaller state norms while improving multi-query recall accuracy ...
Learning to Focus Synthetic Aperture Radar On-line with State-Space Models
eess.IV 2026-05 unverdicted novelty 7.0

An online SAR focusing framework using state-space models processes raw data line-by-line with 70x lower latency and 130x lower memory than block-based DSP while supporting downstream tasks.
TIDES: Implicit Time-Awareness in Selective State Space Models
cs.LG 2026-05 unverdicted novelty 7.0

TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and P...
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
cs.LG 2026-05 unverdicted novelty 7.0

LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
Test-Time Speculation
cs.CL 2026-05 unverdicted novelty 7.0

Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.
Prediction Bottlenecks Don't Discover Causal Structure (But Here's What They Actually Do)
cs.LG 2026-05 conditional novelty 7.0

Prediction bottlenecks do not discover causal structure beyond what linear models, Lasso, and classical Granger/PCMCI methods achieve; intervention benefits are mostly sample-size confounds, leaving a standardized fal...
VORT: Adaptive Power-Law Memory for NLP Transformers
cs.LG 2026-05 unverdicted novelty 7.0

VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network
cs.CV 2026-05 unverdicted novelty 7.0

VIMCAN combines Mamba for temporal efficiency and cross-attention for spatial fusion to reach 17.2 mm MPJPE on TotalCapture and 45.3 mm on 3DPW while running above 60 FPS.
Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control
cs.LG 2026-05 unverdicted novelty 7.0

Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.
Long Context Pre-Training with Lighthouse Attention
cs.CL 2026-05 conditional novelty 7.0

Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
cs.LG 2026-05 unverdicted novelty 7.0

Attention-based models can intrinsically retrieve and reuse pre-encoded evidence chunks via decoder attention queries, unifying retrieval with generation and outperforming external RAG pipelines on QA benchmarks.
How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences
cs.LG 2026-05 unverdicted novelty 7.0

In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate ...
On the Architectural Complexity of Neural Networks
cs.LG 2026-05 unverdicted novelty 7.0

A framework quantifies DNN complexity via tensor operations, links 40 years of breakthroughs to complexity increases, and releases a dataset of 3000+ unexplored high-complexity architectures.
Latent State Design for World Models under Sufficiency Constraints
cs.AI 2026-05 unverdicted novelty 7.0

World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts
cs.LG 2026-05 conditional novelty 7.0

Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant...
Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression
cs.LG 2026-04 unverdicted novelty 7.0

Auto-FlexSwitch achieves efficient dynamic model merging by decomposing task vectors into sparse masks, signs, and scalars, then making the compression learnable via gating and adaptive bit selection with KNN-based retrieval.
ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space
cs.LG 2026-04 unverdicted novelty 7.0

ABC enables any-subset autoregressive generation of continuous stochastic processes via non-Markovian diffusion bridges that track physical time and allow path-dependent conditioning.
Rethink MAE with Linear Time-Invariant Dynamics
cs.CV 2026-04 unverdicted novelty 7.0

Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.
AdaMamba: Adaptive Frequency-Gated Mamba for Long-Term Time Series Forecasting
cs.AI 2026-04 unverdicted novelty 7.0

AdaMamba adds input-dependent frequency bases and a unified time-frequency forgetting gate to Mamba, yielding higher forecasting accuracy than prior methods on standard long-term time series benchmarks.
Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding
eess.AS 2026-04 unverdicted novelty 7.0

LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.
GraphLeap: Decoupling Graph Construction and Convolution for Vision GNN Acceleration on FPGA
cs.CV 2026-04 conditional novelty 7.0

GraphLeap decouples per-layer graph construction from feature updates in Vision GNNs by using previous-layer features for the current graph, enabling pipelined FPGA acceleration with up to 95.7× CPU speedup after fine-tuning.
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
cs.LG 2026-04 unverdicted novelty 7.0

Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
Faster by Design: Interactive Aerodynamics via Neural Surrogates Trained on Expert-Validated CFD
cs.LG 2026-04 unverdicted novelty 7.0

A graph-based neural operator trained on expert-validated race-car CFD data reaches accuracy levels usable for early-stage interactive aerodynamic design exploration.
LiquidTAD: Efficient Temporal Action Detection via Parallel Liquid-Inspired Temporal Relaxation
cs.CV 2026-04 unverdicted novelty 7.0

LiquidTAD distills liquid neural dynamics into a vectorized parallel temporal operator and hierarchical decay sharing to achieve efficient action detection with substantially reduced model size and computation.
Neural Garbage Collection: Learning to Forget while Learning to Reason
cs.LG 2026-04 conditional novelty 7.0

Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.
DGSSM: Diffusion guided state-space models for multimodal salient object detection
cs.CV 2026-04 unverdicted novelty 7.0

DGSSM formulates multimodal salient object detection as a progressive denoising process using diffusion-guided Mamba models, achieving better boundary accuracy and outperforming prior methods on 13 benchmarks.
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
cs.LG 2026-04 unverdicted novelty 7.0

Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
Mamba-SSM with LLM Reasoning for Feature Selection: Faithfulness-Aware Biomarker Discovery
q-bio.QM 2026-04 unverdicted novelty 7.0

LLM chain-of-thought filtering of Mamba saliency features on TCGA-BRCA data produces a 17-gene set with AUC 0.927 that beats both the raw 50-gene saliency list and a 5000-gene baseline while using far fewer features, ...
Mamba Sequence Modeling meets Model Predictive Control
math.OC 2026-04 unverdicted novelty 7.0

Mamba-MPC stabilizes and tracks references on SISO and MIMO systems in simulation and hardware while outperforming LSTM-MPC with faster computation.
Minimax Optimality and Spectral Routing for Majority-Vote Ensembles under Markov Dependence
cs.LG 2026-04 unverdicted novelty 7.0

Majority-vote ensembles on stationary Markov chains have minimax excess risk Omega(sqrt(Tmix/n)); uniform bagging is suboptimal at Omega(Tmix/sqrt(n)), while adaptive spectral routing matches the optimal rate on a gra...
Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size
cs.CL 2026-04 unverdicted novelty 7.0

Contextual entrainment decreases for semantic contexts but increases for non-semantic ones as LLMs scale, following power-law trends with 4x better resistance to misinformation but 2x more copying of arbitrary tokens.
V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos
cs.CV 2026-04 unverdicted novelty 7.0

V-Nutri fuses final-dish features with cooking-process keyframes from egocentric videos to improve dish-level calorie and macronutrient estimation over single-image baselines.
Beyond Reconstruction: Reconstruction-to-Vector Diffusion for Hyperspectral Anomaly Detection
cs.CV 2026-04 unverdicted novelty 7.0

R2VD redefines reconstruction as the origin for residual-guided vector diffusion across PPE, GMP, RSM, and VDI stages to achieve superior anomaly detectability and background suppression on eight datasets.
The Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks
cs.LG 2026-04 unverdicted novelty 7.0

In Kuramoto networks at equilibrium, weak nudging makes phase displacement the exact gradient of loss w.r.t. natural frequencies, enabling frequency learning that beats weight learning and resolves convergence via spe...
Hierarchical Kernel Transformer: Multi-Scale Attention with an Information-Theoretic Approximation Analysis
cs.LG 2026-04 unverdicted novelty 7.0

HKT is a multi-scale attention architecture that bounds computation at 1.31x standard attention, proves kernel and decomposition properties, and reports accuracy gains on ListOps, sequential CIFAR-10, and character-le...
Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention
cs.CL 2026-04 unverdicted novelty 7.0

Kathleen uses recurrent oscillator banks, an efficient wavetable encoder, and phase harmonics to classify text at the byte level with high accuracy and low parameter count.
Controller Design for Structured State-space Models via Contraction Theory
eess.SY 2026-04 unverdicted novelty 7.0

The paper provides the first controllability and observability analysis for structured state-space models, enabling LMI-based controller synthesis via contraction theory and a separation principle for observers and st...
The UNDO Flip-Flop: A Controlled Probe for Reversible Semantic State Management in State Space Model
cs.LG 2026-04 unverdicted novelty 7.0

Mamba-2 models fail to learn reversible state retrieval in the UNDO Flip-Flop task, defaulting to a toggle heuristic and achieving only 41% accuracy under adversarial conditions.
S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
cs.CL 2026-04 conditional novelty 7.0

S0 tuning optimizes initial recurrent states in hybrid models to outperform LoRA with zero inference cost on HumanEval and partial cross-domain transfer.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
cs.LG 2024-05 unverdicted novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
Jamba: A Hybrid Transformer-Mamba Language Model
cs.CL 2024-03 conditional novelty 7.0

Jamba presents a hybrid Transformer-Mamba MoE architecture for LLMs that delivers state-of-the-art benchmark performance and strong results up to 256K token contexts while fitting in one 80GB GPU with high throughput.
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
cs.CV 2024-01 conditional novelty 7.0

Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.
Implicit Behavioral Decoding from Next-Step Spike Forecasts at Population Scale
q-bio.NC 2026-05 unverdicted novelty 6.0

Mamba forecaster trained on next-step spikes decodes mouse choice at 75.7% and stimulus at 66.1%, beating linear decoding on raw spikes by 4-6 percentage points.
Elastic Attention Cores for Scalable Vision Transformers
cs.CV 2026-05 unverdicted novelty 6.0

VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...
Heteroscedastic Diffusion for Multi-Agent Trajectory Modeling
cs.LG 2026-05 unverdicted novelty 6.0

U2Diffine augments diffusion denoising with negative log-likelihood loss and first-order uncertainty propagation to jointly perform trajectory completion and provide per-state heteroscedastic uncertainty for multi-age...
A Single-Layer Model Can Do Language Modeling
cs.CL 2026-05 unverdicted novelty 6.0

A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).
Polygon-mamba: Retinal vessel segmentation using polygon scanning mamba and space-frequency collaborative attention
cs.CV 2026-05 unverdicted novelty 6.0

Polygon-Mamba achieves F1 scores of 0.8283, 0.8282, and 0.8251 on DRIVE, STARE, and CHASE_DB1 by combining polygon scanning Mamba with space-frequency collaborative attention to better detect small retinal vessels.
DynGhost: Temporally-Modelled Transformer for Dynamic Ghost Imaging with Quantum Detectors
cs.CV 2026-05 unverdicted novelty 6.0

DynGhost improves dynamic ghost imaging reconstruction by using a transformer with alternating spatial-temporal attention and quantum-aware training on simulated single-photon detector data.
MambaNetBurst: Direct Byte-level Network Traffic Classification without Tokenization or Pretraining
cs.CR 2026-05 unverdicted novelty 6.0

A compact Mamba-2 model performs end-to-end byte-level network traffic classification without tokenization or pre-training and remains competitive with substantially larger pre-trained systems.
Nectar: Neural Estimation of Cached-Token Attention via Regression
cs.LG 2026-05 unverdicted novelty 6.0

Nectar fits small per-layer per-head neural networks via regression to predict attention outputs and normalizers, enabling constant-time inference independent of context length while preserving semantic generation quality.
MBP-KT: Learning Global Collaborative Information from Meta-Behavioral Pattern for Enhanced Knowledge Tracing
cs.AI 2026-05 unverdicted novelty 6.0

MBP-KT uses meta-behavioral pattern sequences and a parameter-free extractor to inject global collaborative information into knowledge tracing models, consistently improving their performance on real datasets.
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
cs.CL 2026-05 unverdicted novelty 6.0

Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.

Reference graph

Works this paper leans on

122 extracted references · 122 canonical work pages · cited by 166 Pith papers · 12 internal anchors

[1]

Unitary Evolution Recurrent Neural Networks

Martin Arjovsky, Amar Shah, and Yoshua Bengio. “Unitary Evolution Recurrent Neural Networks”. In:The Interna- tional Conference on Machine Learning (ICML) . 2016, pp. 1120–1128. 17

work page 2016
[2]

Effective Gene Expression Prediction from Sequence by Integrating Long-range Interactions

Žiga Avsec, Vikram Agarwal, Daniel Visentin, Joseph R Ledsam, Agnieszka Grabska-Barwinska, Kyle R Taylor, Yannis Assael, John Jumper, Pushmeet Kohli, and David R Kelley. “Effective Gene Expression Prediction from Sequence by Integrating Long-range Interactions”. In: Nature Methods 18.10 (2021), pp. 1196–1203

work page 2021
[3]

Using Fast Weights to Attend to the Recent Past

Jimmy Ba, Geoffrey E Hinton, Volodymyr Mnih, Joel Z Leibo, and Catalin Ionescu. “Using Fast Weights to Attend to the Recent Past”. In: Advances in Neural Information Processing Systems (NeurIPS) 29 (2016)

work page 2016
[4]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. “Layer Normalization”. In:arXiv preprint arXiv:1607.06450 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural Machine Translation by Jointly Learning to Align and Translate”. In: The International Conference on Learning Representations (ICLR) . 2015

work page 2015
[6]

Strongly-typed Recurrent Neural Networks

David Balduzzi and Muhammad Ghifary. “Strongly-typed Recurrent Neural Networks”. In:International Conference on Machine Learning. PMLR. 2016, pp. 1292–1300

work page 2016
[7]

Pythia: A Suite for Analyzing Large Language Models across Training and Scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. “Pythia: A Suite for Analyzing Large Language Models across Training and Scaling”. In: The International Conference on Machine Learning (ICML) . PMLR. 2023, pp. 2397–2430

work page 2023
[8]

PIQA: Reasoning about Physical Commonsense in Natural Language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. “PIQA: Reasoning about Physical Commonsense in Natural Language”. In: Proceedings of the AAAI conference on Artificial Intelligence . Vol. 34. 2020

work page 2020
[9]

S., Purohit, S., Reynolds, L., Tow, J., Wang, B., and Weinbach, S

Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. “Gpt-NeoX-20B: An Open-source Autoregressive Language Model”. In:arXiv preprint arXiv:2204.06745 (2022)

work page arXiv 2022
[10]

Prefix Sums and Their Applications

Guy E Blelloch. “Prefix Sums and Their Applications”. In: (1990)

work page 1990
[11]

Quasi-recurrent Neural Networks

James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. “Quasi-recurrent Neural Networks”. In: arXiv preprint arXiv:1611.01576 (2016)

work page arXiv 2016
[12]

Language Models are Few-shot Learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language Models are Few-shot Learners”. In: Advances in Neural Information Processing Systems (NeurIPS) 33 (2020), pp. 1877–1901

work page 2020
[13]

Scaling transformer to 1m tokens and beyond with rmt,

Aydar Bulatov, Yuri Kuratov, and Mikhail S Burtsev. “Scaling Transformer to 1M tokens and Beyond with RMT”. In: arXiv preprint arXiv:2304.11062 (2023)

work page arXiv 2023
[14]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. “Generating Long Sequences with Sparse Transformers”. In: arXiv preprint arXiv:1904.10509 (2019)

work page internal anchor Pith review arXiv 1904
[15]

Rethinking Attention with Performers

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. “Rethinking Attention with Performers”. In: The International Conference on Learning Representations (ICLR) . 2021

work page 2021
[16]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. “PaLM: Scaling Language Modeling with Pathways”. In: Journal of Machine Learning Research 24.240 (2023), pp. 1–113. url: http://jmlr.org/papers/v24/22- 1144.html

work page 2023
[17]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling”. In: arXiv preprint arXiv:1412.3555 (2014)

work page internal anchor Pith review arXiv 2014
[18]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. In:arXiv preprint arXiv:1803.05457 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning”. In:The International Conference on Learning Representations (ICLR) . 2024

work page 2024
[20]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”. In:Advances in Neural Information Processing Systems (NeurIPS) . 2022

work page 2022
[21]

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

Tri Dao, Daniel Y Fu, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher Ré. “Hungry Hungry Hippos: Towards Language Modeling with State Space Models”. In:The International Conference on Learning Representations (ICLR). 2023

work page 2023
[22]

Language Modeling with Gated Convolutional Networks

Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. “Language Modeling with Gated Convolutional Networks”. In: The International Conference on Machine Learning (ICML) . PMLR. 2017, pp. 933–941

work page 2017
[23]

SampleRNN

DeepSound. SampleRNN. https://github.com/deepsound-project/samplernn-pytorch. 2017

work page 2017
[24]

LongNet: Scaling transformers to 1,000,000,000 tokens.arXiv preprint arXiv:2307.02486,

Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, and Furu Wei. “LongNet: Scaling Transformers to 1,000,000,000 Tokens”. In:arXiv preprint arXiv:2307.02486 (2023). 18

work page arXiv 2023
[25]

Adversarial Audio Synthesis

Chris Donahue, Julian McAuley, and Miller Puckette. “Adversarial Audio Synthesis”. In:The International Conference on Learning Representations (ICLR) . 2019

work page 2019
[26]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”. In: The International Conference on Learning Representations (ICLR) . 2020

work page 2020
[27]

A Mathematical Framework for Transformer Circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. “...

work page 2021
[28]

Block-State Transformer

Mahan Fathi, Jonathan Pilault, Pierre-Luc Bacon, Christopher Pal, Orhan Firat, and Ross Goroshin. “Block-State Transformer”. In: arXiv preprint arXiv:2306.09539 (2023)

work page arXiv 2023
[29]

Multi-Head State Space Model for Speech Recognition

Yassir Fathullah, Chunyang Wu, Yuan Shangguan, Junteng Jia, Wenhan Xiong, Jay Mahadeokar, Chunxi Liu, Yangyang Shi, Ozlem Kalinli, Mike Seltzer, and Mark J. F. Gales. “Multi-Head State Space Model for Speech Recognition”. In: Proc. INTERSPEECH 2023 . 2023, pp. 241–245. doi: 10.21437/Interspeech.2023-1036

work page doi:10.21437/interspeech.2023-1036 2023
[30]

Dynamic Causal Modelling

Karl J Friston, Lee Harrison, and Will Penny. “Dynamic Causal Modelling”. In:Neuroimage 19.4 (2003), pp. 1273– 1302

work page 2003
[31]

Simple Hardware-efficient Long Convolutions for Sequence Modeling

Daniel Y Fu, Elliot L Epstein, Eric Nguyen, Armin W Thomas, Michael Zhang, Tri Dao, Atri Rudra, and Christopher Ré. “Simple Hardware-efficient Long Convolutions for Sequence Modeling”. In:The International Conference on Machine Learning (ICML) (2023)

work page 2023
[32]

Approximation of Dynamical Systems by Continuous Time Recurrent Neural Networks

Ken-ichi Funahashi and Yuichi Nakamura. “Approximation of Dynamical Systems by Continuous Time Recurrent Neural Networks”. In: Neural Networks 6.6 (1993), pp. 801–806

work page 1993
[33]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. “The Pile: An 800GB Dataset of Diverse Text for Language Modeling”. In: arXiv preprint arXiv:2101.00027 (2020)

work page internal anchor Pith review arXiv 2020
[34]

Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex

Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A Framework for Few-shot Language Model Evaluation . Version v0.0.1. Sept. 2021. doi: 10.5281/zenodo.5371628. url: http...

work page doi:10.5281/zenodo.5371628 2021
[35]

It’s Raw! Audio Generation with State-Space Models

Karan Goel, Albert Gu, Chris Donahue, and Christopher Ré. “It’s Raw! Audio Generation with State-Space Models”. In: The International Conference on Machine Learning (ICML) . 2022

work page 2022
[36]

HIPPO: Recurrent Memory with Optimal Polynomial Projections

Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. “HIPPO: Recurrent Memory with Optimal Polynomial Projections”. In: Advances in Neural Information Processing Systems (NeurIPS) . 2020

work page 2020
[37]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher Ré. “Efficiently Modeling Long Sequences with Structured State Spaces”. In: The International Conference on Learning Representations (ICLR) . 2022

work page 2022
[38]

Improving the Gating Mechanism of Recurrent Neural Networks

Albert Gu, Caglar Gulcehre, Tom Le Paine, Matt Hoffman, and Razvan Pascanu. “Improving the Gating Mechanism of Recurrent Neural Networks”. In: The International Conference on Machine Learning (ICML) . 2020

work page 2020
[39]

On the Parameterization and Initialization of Diagonal State Space Models

Albert Gu, Ankit Gupta, Karan Goel, and Christopher Ré. “On the Parameterization and Initialization of Diagonal State Space Models”. In: Advances in Neural Information Processing Systems (NeurIPS) . 2022

work page 2022
[40]

Combining Recurrent, Convolutional, and Continuous-time Models with the Linear State Space Layer

Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. “Combining Recurrent, Convolutional, and Continuous-time Models with the Linear State Space Layer”. In:Advances in Neural Information Processing Systems (NeurIPS). 2021

work page 2021
[41]

How to Train Your HIPPO: State Space Models with Generalized Basis Projections

Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, and Christopher Ré. “How to Train Your HIPPO: State Space Models with Generalized Basis Projections”. In: The International Conference on Learning Representations (ICLR) . 2023

work page 2023
[42]

Diagonal State Spaces are as Effective as Structured State Spaces

Ankit Gupta, Albert Gu, and Jonathan Berant. “Diagonal State Spaces are as Effective as Structured State Spaces”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 22982–22994

work page 2022
[43]

Simplifying and Understanding State Space Models with Diagonal Linear RNNs

Ankit Gupta, Harsh Mehta, and Jonathan Berant. “Simplifying and Understanding State Space Models with Diagonal Linear RNNs”. In: arXiv preprint arXiv:2212.00768 (2022)

work page arXiv 2022
[44]

HyperNetworks

David Ha, Andrew Dai, and Quoc V. Le. “HyperNetworks”. In:The International Conference on Learning Representa- tions (ICLR). 2017

work page 2017
[45]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. “Dream to Control: Learning Behaviors by Latent Imagination”. In: The International Conference on Learning Representations (ICLR) . 2020. 19

work page 2020
[46]

Liquid Structural State-Space Models

Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, and Daniela Rus. “Liquid Structural State-Space Models”. In: The International Conference on Learning Representations (ICLR) . 2023

work page 2023
[47]

Recurrent Orthogonal Networks and Long-Memory Tasks

Mikael Henaff, Arthur Szlam, and Yann LeCun. “Recurrent Orthogonal Networks and Long-Memory Tasks”. In: The International Conference on Machine Learning (ICML) . 2016

work page 2016
[48]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. “Gaussian Error Linear Units (GELUs)”. In: arXiv preprint arXiv:1606.08415 (2016)

work page Pith review arXiv 2016
[49]

Untersuchungen zu dynamischen neuronalen Netzen

Sepp Hochreiter. “Untersuchungen zu dynamischen neuronalen Netzen”. In: Diploma, Technische Universität München 91.1 (1991), p. 31

work page 1991
[50]

Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-term Dependencies

Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber, et al. Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-term Dependencies. 2001

work page 2001
[51]

Long Short-Term Memory

Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory”. In:Neural Computation 9.8 (1997), pp. 1735– 1780

work page 1997
[52]

An Empirical Analysis of Compute- Optimal Large Language Model Training

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. “An Empirical Analysis of Compute- Optimal Large Language Model Training”. In:Advances in Neural Information Processing Systems (NeurIPS) 35 (2022), pp. 30016–30030

work page 2022
[53]

Transformer Quality in Linear Time

Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc Le. “Transformer Quality in Linear Time”. In:The International Conference on Machine Learning (ICML) . PMLR. 2022, pp. 9099–9117

work page 2022
[54]

Deep Learning for Time Series Classification: A Review

Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre-Alain Muller. “Deep Learning for Time Series Classification: A Review”. In: Data Mining and Knowledge Discovery 33.4 (2019), pp. 917– 963

work page 2019
[55]

Data Movement is All You Need: A Case Study on Optimizing Transformers

Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. “Data Movement is All You Need: A Case Study on Optimizing Transformers”. In: Proceedings of Machine Learning and Systems 3 (2021), pp. 711–732

work page 2021
[56]

Gated Orthogonal Recurrent Units: On Learning to Forget

Li Jing, Caglar Gulcehre, John Peurifoy, Yichen Shen, Max Tegmark, Marin Soljacic, and Yoshua Bengio. “Gated Orthogonal Recurrent Units: On Learning to Forget”. In: Neural Computation 31.4 (2019), pp. 765–783

work page 2019
[57]

A New Approach to Linear Filtering and Prediction Problems

Rudolph Emil Kalman. “A New Approach to Linear Filtering and Prediction Problems”. In: (1960)

work page 1960
[58]

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention”. In:International Conference on Machine Learning . PMLR. 2020, pp. 5156–5165

work page 2020
[59]

Linear Dynamical Systems as a Core Computational Primitive

Shiva Kaul. “Linear Dynamical Systems as a Core Computational Primitive”. In:Advances in Neural Information Processing Systems 33 (2020), pp. 16808–16820

work page 2020
[60]

DiffWave: A Versatile Diffusion Model for Audio Synthesis

Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. “DiffWave: A Versatile Diffusion Model for Audio Synthesis”. In:International Conference on Learning Representations . 2021

work page 2021
[61]

Time-Parameterized Convolutional Neural Networks for Irregularly Sampled Time Series

Chrysoula Kosma, Giannis Nikolentzos, and Michalis Vazirgiannis. “Time-Parameterized Convolutional Neural Networks for Irregularly Sampled Time Series”. In: arXiv preprint arXiv:2308.03210 (2023)

work page arXiv 2023
[62]

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet Classification with Deep Convolutional Neural Networks”. In: Advances in Neural Information Processing Systems (NeurIPS) 25 (2012)

work page 2012
[63]

When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute

Tao Lei. “When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute”. In:Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing . 2021, pp. 7633–7648

work page 2021
[64]

Simple Recurrent Units for Highly Parallelizable Recur- rence

Tao Lei, Yu Zhang, Sida I Wang, Hui Dai, and Yoav Artzi. “Simple Recurrent Units for Highly Parallelizable Recurrence”. In: arXiv preprint arXiv:1709.02755 (2017)

work page arXiv 2017
[65]

Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and Unitary Group

Mario Lezcano-Casado and David Martínez-Rubio. “Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and Unitary Group”. In: The International Conference on Machine Learning (ICML). 2019

work page 2019
[66]

What Makes Convolutional Models Great on Long Sequence Modeling?

Yuhong Li, Tianle Cai, Yi Zhang, Deming Chen, and Debadeepta Dey. “What Makes Convolutional Models Great on Long Sequence Modeling?” In: The International Conference on Learning Representations (ICLR) . 2023

work page 2023
[67]

Time-aware Large Kernel Convolutions

Vasileios Lioutas and Yuhong Guo. “Time-aware Large Kernel Convolutions”. In:The International Conference on Machine Learning (ICML). PMLR. 2020, pp. 6172–6183

work page 2020
[68]

Structured State Space Models for In-Context Reinforcement Learning

Chris Lu, Yannick Schroecker, Albert Gu, Emilio Parisotto, Jakob Foerster, Satinder Singh, and Feryal Behbahani. “Structured State Space Models for In-Context Reinforcement Learning”. In:Advances in Neural Information Processing Systems (NeurIPS). 2023

work page 2023
[69]

Focus Your Attention (with Adaptive IIR Filters)

Shahar Lutati, Itamar Zimerman, and Lior Wolf. “Focus Your Attention (with Adaptive IIR Filters)”. In:arXiv preprint arXiv:2305.14952 (2023). 20

work page arXiv 2023
[70]

Mega: Moving Average Equipped Gated Attention

Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. “Mega: Moving Average Equipped Gated Attention”. In:The International Conference on Learning Representations (ICLR). 2023

work page 2023
[71]

Parallelizing Linear Recurrent Neural Nets Over Sequence Length

Eric Martin and Chris Cundy. “Parallelizing Linear Recurrent Neural Nets Over Sequence Length”. In:The Interna- tional Conference on Learning Representations (ICLR) . 2018

work page 2018
[72]

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. “SampleRNN: An Unconditional End-to-End Neural Audio Generation Model”. In:The International Conference on Learning Representations (ICLR) . 2017

work page 2017
[73]

Long Range Language Modeling via Gated State Spaces

Harsh Mehta, Ankit Gupta, Ashok Cutkosky, and Behnam Neyshabur. “Long Range Language Modeling via Gated State Spaces”. In: The International Conference on Learning Representations (ICLR) . 2023

work page 2023
[74]

Efficient Orthogonal Parametrisation of Recurrent Neural Networks using Householder Reflections

Zakaria Mhammedi, Andrew Hellicar, Ashfaqur Rahman, and James Bailey. “Efficient Orthogonal Parametrisation of Recurrent Neural Networks using Householder Reflections”. In:International Conference on Machine Learning . PMLR. 2017, pp. 2401–2409

work page 2017
[75]

S4ND: Modeling Images and Videos as Multidimensional Signals with State Spaces

Eric Nguyen, Karan Goel, Albert Gu, Gordon Downs, Preey Shah, Tri Dao, Stephen Baccus, and Christopher Ré. “S4ND: Modeling Images and Videos as Multidimensional Signals with State Spaces”. In:Advances in Neural Information Processing Systems (NeurIPS) . 2022

work page 2022
[76]

HyenaDNA: Long-range Genomic Sequence Modeling at Single Nucleotide Resolution

Eric Nguyen, Michael Poli, Marjan Faizi, Armin Thomas, Callum Birch-Sykes, Michael Wornow, Aman Patel, Clayton Rabideau, Stefano Massaroli, Yoshua Bengio, et al. “HyenaDNA: Long-range Genomic Sequence Modeling at Single Nucleotide Resolution”. In: Advances in Neural Information Processing Systems (NeurIPS) . 2023

work page 2023
[77]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

work page 2022
[78]

WaveNet: A Generative Model for Raw Audio

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. “WaveNet: A Generative Model for Raw Audio”. In: arXiv preprint arXiv:1609.03499 (2016)

work page internal anchor Pith review arXiv 2016
[79]

Resurrecting Recurrent Neural Networks for Long Sequences

Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. “Resurrecting Recurrent Neural Networks for Long Sequences”. In:The International Conference on Machine Learning (ICML). 2023

work page 2023
[80]

The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc-Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. “The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context”. In:Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics . 2016, pp. 1525–1534

work page 2016

Showing first 80 references.