pith. machine review for the scientific record. sign in

arxiv: 2312.00752 · v2 · submitted 2023-12-01 · 💻 cs.LG · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu, Tri Dao

Pith reviewed 2026-05-10 11:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords state space modelssequence modelingselective SSMlinear-time architectureslanguage modelingTransformerslong sequences
0
0 comments X

The pith

Selective SSMs let Mamba model sequences linearly while matching larger Transformers on language tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Mamba, a sequence model built on selective state space models. Prior subquadratic models could not perform content-based reasoning because their parameters stayed fixed across the sequence. The fix is to compute the SSM parameters as functions of each input token so the model can decide what information to keep or discard. A hardware-aware parallel algorithm makes the computation efficient despite the loss of convolution structure. The resulting architecture uses no attention or MLP blocks, scales linearly with length, runs with five times the inference throughput of Transformers, and reaches strong results on language, audio, and genomics data.

Core claim

By allowing the state transition parameters of a state space model to depend on the current input, the model gains the ability to selectively propagate or forget information along the sequence. When these selective SSMs are stacked into a simplified end-to-end network without attention or MLP blocks, the architecture achieves linear scaling in sequence length, five times higher inference throughput than Transformers, and state-of-the-art performance across modalities. On language modeling a 3B-parameter Mamba model outperforms Transformers of the same size and matches Transformers twice its size in both pretraining and downstream evaluation.

What carries the argument

Selective SSMs, in which the state transition and output parameters are computed from the input at each step to enable content-dependent propagation or forgetting of information.

If this is right

  • Performance on real data improves as sequence length grows to a million tokens.
  • A 3B Mamba model outperforms same-size Transformers and matches twice-as-large Transformers on language pretraining and downstream tasks.
  • Inference throughput reaches five times that of comparable Transformers while maintaining linear scaling.
  • State-of-the-art results appear on language, audio, and genomics without attention or MLP blocks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same input-dependent selectivity pattern could be added to other linear recurrent architectures to improve their handling of long-range dependencies.
  • Hardware-aware parallel scans for selective recurrence may become a standard optimization for any model that trades attention for linear time.
  • If the pattern generalizes, smaller models built this way could replace larger attention-based models in applications that need long context windows.

Load-bearing premise

Making SSM parameters depend on the input is enough to overcome the content-based reasoning weakness of earlier subquadratic models, and the resulting selective SSMs can be trained stably at scale in a simplified architecture without attention or MLP blocks.

What would settle it

Training Mamba models on long language sequences and observing that they underperform same-size Transformers on standard benchmarks, or that wall-clock inference time grows faster than linearly with sequence length, would falsify the central claims.

read the original abstract

Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5$\times$ higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Mamba, a simplified sequence model architecture built from selective state space models (SSMs). It identifies the lack of content-based reasoning in prior subquadratic models (linear attention, gated convolutions, standard SSMs) as their key limitation on discrete modalities like language. The core technical contribution is making the SSM parameters Δ, B, and C input-dependent, which enables selective propagation or forgetting of information along the sequence. A hardware-aware parallel scan algorithm is derived to enable efficient training despite the loss of convolution structure. The resulting Mamba block stack contains no attention or MLP layers. Empirical claims include linear scaling to million-length sequences, 5× higher inference throughput than Transformers, and state-of-the-art results across language, audio, and genomics; specifically, a 3B-parameter Mamba model outperforms same-size Transformers and matches twice-as-large Transformers on both pretraining perplexity and downstream tasks.

Significance. If the empirical results and the attribution to selectivity are reproducible, the work is significant. It supplies a concrete, scalable mechanism that converts a long-standing weakness of SSMs into a strength while preserving linear complexity and fast inference. The combination of a parameter-efficient selective recurrence with a custom parallel algorithm offers a plausible path toward replacing attention-based backbones on long-context tasks. The paper also ships the implementation details and scaling curves needed for follow-up work.

major comments (2)
  1. [§5.2] §5.2 (Language Modeling Results) and Table 2: the headline claim that Mamba-3B matches Transformers twice its size is load-bearing for the architectural conclusion. However, the manuscript provides no ablation that holds model size, training tokens, optimizer, and residual structure fixed while disabling input-dependence of Δ, B, C (i.e., reverting to a standard SSM). Without this control, it remains possible that the observed gains arise from the particular projection dimensions, the simplified block design, or hyper-parameter differences rather than selectivity itself.
  2. [§3.3] §3.3 (Hardware-Aware Algorithm) and Algorithm 1: the parallel scan is presented as numerically stable and hardware-efficient, yet the paper does not report the condition number of the discretized state transition matrix or any ablation on floating-point precision (FP16 vs. BF16) across sequence lengths up to 1M. Because selectivity makes the recurrence input-dependent, small numerical errors could accumulate differently than in time-invariant SSMs; this should be quantified to support the “stable training at scale” claim.
minor comments (2)
  1. [Figure 2] Figure 2 (scaling curves) uses log-log axes but does not label the exact sequence lengths or batch sizes used for the throughput measurements; this makes direct comparison with the Transformer baselines harder.
  2. [§3.1] Notation: the symbol Δ is overloaded between the continuous-time step size and the input-dependent discretization parameter; a brief clarification in §3.1 would avoid confusion for readers familiar with the original S4 formulation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and have incorporated revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§5.2] §5.2 (Language Modeling Results) and Table 2: the headline claim that Mamba-3B matches Transformers twice its size is load-bearing for the architectural conclusion. However, the manuscript provides no ablation that holds model size, training tokens, optimizer, and residual structure fixed while disabling input-dependence of Δ, B, C (i.e., reverting to a standard SSM). Without this control, it remains possible that the observed gains arise from the particular projection dimensions, the simplified block design, or hyper-parameter differences rather than selectivity itself.

    Authors: We agree that a tightly controlled ablation isolating input-dependent selectivity (while fixing model size, data, optimizer, and block structure) would provide stronger evidence for the architectural conclusion. Prior comparisons in the manuscript were to external models such as S4 rather than an internal non-selective control. We have added this ablation in the revised Section 5.2 and appendix: a non-selective Mamba-3B variant (time-invariant Δ, B, C) trained under identical conditions shows a clear performance degradation relative to the selective version, supporting that selectivity drives the gains rather than other design choices. revision: yes

  2. Referee: [§3.3] §3.3 (Hardware-Aware Algorithm) and Algorithm 1: the parallel scan is presented as numerically stable and hardware-efficient, yet the paper does not report the condition number of the discretized state transition matrix or any ablation on floating-point precision (FP16 vs. BF16) across sequence lengths up to 1M. Because selectivity makes the recurrence input-dependent, small numerical errors could accumulate differently than in time-invariant SSMs; this should be quantified to support the “stable training at scale” claim.

    Authors: We acknowledge that explicit quantification of numerical properties under input-dependent selectivity strengthens the stability claim. The parallel scan uses standard associative operations, and we observed no instability during training. In the revision we have added (i) condition-number statistics for the discretized state matrices across sequence lengths, showing they remain well-bounded, and (ii) FP16/BF16 precision ablations up to 1M tokens demonstrating equivalent convergence and no differential error accumulation. These results are reported in the updated §3.3 and appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical validation of input-dependent SSMs

full rationale

The paper motivates selective SSMs by identifying a content-based reasoning weakness in prior subquadratic models, proposes making parameters (Δ, B, C) input-dependent as a direct fix, and validates the resulting Mamba architecture through large-scale training and benchmarking on language, audio, and genomics tasks. No derivation step reduces a claimed prediction to a fitted parameter by construction, invokes a self-citation as an unverified uniqueness theorem, or renames an empirical pattern as a first-principles result. The hardware-aware algorithm and end-to-end architecture are presented as engineering choices whose performance is measured externally, leaving the central claims independently falsifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central performance claims rest on the empirical success of the selective SSM design and the assumption that input-dependent parameters suffice for content-based reasoning; no external benchmarks or formal derivations are cited in the abstract.

free parameters (1)
  • model size
    Performance is reported specifically for a 3B parameter model; the scaling behavior depends on this choice.
axioms (1)
  • domain assumption Input-dependent SSM parameters enable content-based reasoning
    Invoked to justify the selective mechanism as addressing the weakness of prior subquadratic models.
invented entities (1)
  • selective SSM no independent evidence
    purpose: To allow the model to selectively propagate or forget information along the sequence based on current token content
    New concept introduced to overcome the content-based reasoning limitation of standard SSMs.

pith-pipeline@v0.9.0 · 5553 in / 1385 out tokens · 27894 ms · 2026-05-10T11:48:06.209749+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Convergent Stochastic Training of Attention and Understanding LoRA

    cs.LG 2026-05 unverdicted novelty 8.0

    Attention and LoRA regression losses induce Poincaré inequalities under mild regularization, so SGD-mimicking SDEs converge to minimizers with no assumptions on data or model size.

  2. Learning the Signature of Memorization in Autoregressive Language Models

    cs.CL 2026-04 accept novelty 8.0

    A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.

  3. The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

    cs.LG 2026-04 unverdicted novelty 8.0

    Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicti...

  4. RULER: What's the Real Context Size of Your Long-Context Language Models?

    cs.CL 2024-04 accept novelty 8.0

    RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.

  5. Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo

    cond-mat.str-el 2026-05 conditional novelty 7.0

    PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.

  6. SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting

    q-bio.NC 2026-05 unverdicted novelty 7.0

    SpikeProphecy decomposes spike-count forecasting performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing reproducible brain-region predictability rankings and a sub-P...

  7. Selection, Not Fusion: Radar-Modulated State Space Models for Radar-Camera Depth Estimation

    cs.CV 2026-05 unverdicted novelty 7.0

    Radar-Modulated Selection perturbs only the step size Δ and readout C parameters inside Mamba's selective scan with radar data while keeping other components image-only, yielding state-of-the-art depth estimation on n...

  8. TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles

    cs.CV 2026-05 unverdicted novelty 7.0

    TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.

  9. Variational Linear Attention: Stable Associative Memory for Long-Context Transformers

    cs.LG 2026-05 conditional novelty 7.0

    VLA stabilizes linear attention by solving regularized least-squares updates with unit-length writes, yielding Jacobian spectral norm exactly 1 and 109x smaller state norms while improving multi-query recall accuracy ...

  10. Learning to Focus Synthetic Aperture Radar On-line with State-Space Models

    eess.IV 2026-05 unverdicted novelty 7.0

    An online SAR focusing framework using state-space models processes raw data line-by-line with 70x lower latency and 130x lower memory than block-based DSP while supporting downstream tasks.

  11. TIDES: Implicit Time-Awareness in Selective State Space Models

    cs.LG 2026-05 unverdicted novelty 7.0

    TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and P...

  12. LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.

  13. Test-Time Speculation

    cs.CL 2026-05 unverdicted novelty 7.0

    Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.

  14. Prediction Bottlenecks Don't Discover Causal Structure (But Here's What They Actually Do)

    cs.LG 2026-05 conditional novelty 7.0

    Prediction bottlenecks do not discover causal structure beyond what linear models, Lasso, and classical Granger/PCMCI methods achieve; intervention benefits are mostly sample-size confounds, leaving a standardized fal...

  15. VORT: Adaptive Power-Law Memory for NLP Transformers

    cs.LG 2026-05 unverdicted novelty 7.0

    VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.

  16. VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network

    cs.CV 2026-05 unverdicted novelty 7.0

    VIMCAN combines Mamba for temporal efficiency and cross-attention for spatial fusion to reach 17.2 mm MPJPE on TotalCapture and 45.3 mm on 3DPW while running above 60 FPS.

  17. Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control

    cs.LG 2026-05 unverdicted novelty 7.0

    Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.

  18. Long Context Pre-Training with Lighthouse Attention

    cs.CL 2026-05 conditional novelty 7.0

    Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...

  19. Retrieval from Within: An Intrinsic Capability of Attention-Based Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Attention-based models can intrinsically retrieve and reuse pre-encoded evidence chunks via decoder attention queries, unifying retrieval with generation and outperforming external RAG pipelines on QA benchmarks.

  20. How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences

    cs.LG 2026-05 unverdicted novelty 7.0

    In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate ...

  21. On the Architectural Complexity of Neural Networks

    cs.LG 2026-05 unverdicted novelty 7.0

    A framework quantifies DNN complexity via tensor operations, links 40 years of breakthroughs to complexity increases, and releases a dataset of 3000+ unexplored high-complexity architectures.

  22. Latent State Design for World Models under Sufficiency Constraints

    cs.AI 2026-05 unverdicted novelty 7.0

    World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.

  23. Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts

    cs.LG 2026-05 conditional novelty 7.0

    Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant...

  24. Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression

    cs.LG 2026-04 unverdicted novelty 7.0

    Auto-FlexSwitch achieves efficient dynamic model merging by decomposing task vectors into sparse masks, signs, and scalars, then making the compression learnable via gating and adaptive bit selection with KNN-based retrieval.

  25. ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space

    cs.LG 2026-04 unverdicted novelty 7.0

    ABC enables any-subset autoregressive generation of continuous stochastic processes via non-Markovian diffusion bridges that track physical time and allow path-dependent conditioning.

  26. Rethink MAE with Linear Time-Invariant Dynamics

    cs.CV 2026-04 unverdicted novelty 7.0

    Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.

  27. AdaMamba: Adaptive Frequency-Gated Mamba for Long-Term Time Series Forecasting

    cs.AI 2026-04 unverdicted novelty 7.0

    AdaMamba adds input-dependent frequency bases and a unified time-frequency forgetting gate to Mamba, yielding higher forecasting accuracy than prior methods on standard long-term time series benchmarks.

  28. Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding

    eess.AS 2026-04 unverdicted novelty 7.0

    LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.

  29. GraphLeap: Decoupling Graph Construction and Convolution for Vision GNN Acceleration on FPGA

    cs.CV 2026-04 conditional novelty 7.0

    GraphLeap decouples per-layer graph construction from feature updates in Vision GNNs by using previous-layer features for the current graph, enabling pipelined FPGA acceleration with up to 95.7× CPU speedup after fine-tuning.

  30. Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences

    cs.LG 2026-04 unverdicted novelty 7.0

    Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.

  31. Faster by Design: Interactive Aerodynamics via Neural Surrogates Trained on Expert-Validated CFD

    cs.LG 2026-04 unverdicted novelty 7.0

    A graph-based neural operator trained on expert-validated race-car CFD data reaches accuracy levels usable for early-stage interactive aerodynamic design exploration.

  32. LiquidTAD: Efficient Temporal Action Detection via Parallel Liquid-Inspired Temporal Relaxation

    cs.CV 2026-04 unverdicted novelty 7.0

    LiquidTAD distills liquid neural dynamics into a vectorized parallel temporal operator and hierarchical decay sharing to achieve efficient action detection with substantially reduced model size and computation.

  33. Neural Garbage Collection: Learning to Forget while Learning to Reason

    cs.LG 2026-04 conditional novelty 7.0

    Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.

  34. DGSSM: Diffusion guided state-space models for multimodal salient object detection

    cs.CV 2026-04 unverdicted novelty 7.0

    DGSSM formulates multimodal salient object detection as a progressive denoising process using diffusion-guided Mamba models, achieving better boundary accuracy and outperforming prior methods on 13 benchmarks.

  35. Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

    cs.LG 2026-04 unverdicted novelty 7.0

    Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.

  36. Mamba-SSM with LLM Reasoning for Feature Selection: Faithfulness-Aware Biomarker Discovery

    q-bio.QM 2026-04 unverdicted novelty 7.0

    LLM chain-of-thought filtering of Mamba saliency features on TCGA-BRCA data produces a 17-gene set with AUC 0.927 that beats both the raw 50-gene saliency list and a 5000-gene baseline while using far fewer features, ...

  37. Mamba Sequence Modeling meets Model Predictive Control

    math.OC 2026-04 unverdicted novelty 7.0

    Mamba-MPC stabilizes and tracks references on SISO and MIMO systems in simulation and hardware while outperforming LSTM-MPC with faster computation.

  38. Minimax Optimality and Spectral Routing for Majority-Vote Ensembles under Markov Dependence

    cs.LG 2026-04 unverdicted novelty 7.0

    Majority-vote ensembles on stationary Markov chains have minimax excess risk Omega(sqrt(Tmix/n)); uniform bagging is suboptimal at Omega(Tmix/sqrt(n)), while adaptive spectral routing matches the optimal rate on a gra...

  39. Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size

    cs.CL 2026-04 unverdicted novelty 7.0

    Contextual entrainment decreases for semantic contexts but increases for non-semantic ones as LLMs scale, following power-law trends with 4x better resistance to misinformation but 2x more copying of arbitrary tokens.

  40. V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    V-Nutri fuses final-dish features with cooking-process keyframes from egocentric videos to improve dish-level calorie and macronutrient estimation over single-image baselines.

  41. Beyond Reconstruction: Reconstruction-to-Vector Diffusion for Hyperspectral Anomaly Detection

    cs.CV 2026-04 unverdicted novelty 7.0

    R2VD redefines reconstruction as the origin for residual-guided vector diffusion across PPE, GMP, RSM, and VDI stages to achieve superior anomaly detectability and background suppression on eight datasets.

  42. The Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks

    cs.LG 2026-04 unverdicted novelty 7.0

    In Kuramoto networks at equilibrium, weak nudging makes phase displacement the exact gradient of loss w.r.t. natural frequencies, enabling frequency learning that beats weight learning and resolves convergence via spe...

  43. Hierarchical Kernel Transformer: Multi-Scale Attention with an Information-Theoretic Approximation Analysis

    cs.LG 2026-04 unverdicted novelty 7.0

    HKT is a multi-scale attention architecture that bounds computation at 1.31x standard attention, proves kernel and decomposition properties, and reports accuracy gains on ListOps, sequential CIFAR-10, and character-le...

  44. Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention

    cs.CL 2026-04 unverdicted novelty 7.0

    Kathleen uses recurrent oscillator banks, an efficient wavetable encoder, and phase harmonics to classify text at the byte level with high accuracy and low parameter count.

  45. Controller Design for Structured State-space Models via Contraction Theory

    eess.SY 2026-04 unverdicted novelty 7.0

    The paper provides the first controllability and observability analysis for structured state-space models, enabling LMI-based controller synthesis via contraction theory and a separation principle for observers and st...

  46. The UNDO Flip-Flop: A Controlled Probe for Reversible Semantic State Management in State Space Model

    cs.LG 2026-04 unverdicted novelty 7.0

    Mamba-2 models fail to learn reversible state retrieval in the UNDO Flip-Flop task, defaulting to a toggle heuristic and achieving only 41% accuracy under adversarial conditions.

  47. S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

    cs.CL 2026-04 conditional novelty 7.0

    S0 tuning optimizes initial recurrent states in hybrid models to outperform LoRA with zero inference cost on HumanEval and partial cross-domain transfer.

  48. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    cs.LG 2024-05 unverdicted novelty 7.0

    Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.

  49. Jamba: A Hybrid Transformer-Mamba Language Model

    cs.CL 2024-03 conditional novelty 7.0

    Jamba presents a hybrid Transformer-Mamba MoE architecture for LLMs that delivers state-of-the-art benchmark performance and strong results up to 256K token contexts while fitting in one 80GB GPU with high throughput.

  50. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

    cs.CV 2024-01 conditional novelty 7.0

    Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.

  51. Implicit Behavioral Decoding from Next-Step Spike Forecasts at Population Scale

    q-bio.NC 2026-05 unverdicted novelty 6.0

    Mamba forecaster trained on next-step spikes decodes mouse choice at 75.7% and stimulus at 66.1%, beating linear decoding on raw spikes by 4-6 percentage points.

  52. Elastic Attention Cores for Scalable Vision Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...

  53. Heteroscedastic Diffusion for Multi-Agent Trajectory Modeling

    cs.LG 2026-05 unverdicted novelty 6.0

    U2Diffine augments diffusion denoising with negative log-likelihood loss and first-order uncertainty propagation to jointly perform trajectory completion and provide per-state heteroscedastic uncertainty for multi-age...

  54. A Single-Layer Model Can Do Language Modeling

    cs.CL 2026-05 unverdicted novelty 6.0

    A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).

  55. Polygon-mamba: Retinal vessel segmentation using polygon scanning mamba and space-frequency collaborative attention

    cs.CV 2026-05 unverdicted novelty 6.0

    Polygon-Mamba achieves F1 scores of 0.8283, 0.8282, and 0.8251 on DRIVE, STARE, and CHASE_DB1 by combining polygon scanning Mamba with space-frequency collaborative attention to better detect small retinal vessels.

  56. DynGhost: Temporally-Modelled Transformer for Dynamic Ghost Imaging with Quantum Detectors

    cs.CV 2026-05 unverdicted novelty 6.0

    DynGhost improves dynamic ghost imaging reconstruction by using a transformer with alternating spatial-temporal attention and quantum-aware training on simulated single-photon detector data.

  57. MambaNetBurst: Direct Byte-level Network Traffic Classification without Tokenization or Pretraining

    cs.CR 2026-05 unverdicted novelty 6.0

    A compact Mamba-2 model performs end-to-end byte-level network traffic classification without tokenization or pre-training and remains competitive with substantially larger pre-trained systems.

  58. Nectar: Neural Estimation of Cached-Token Attention via Regression

    cs.LG 2026-05 unverdicted novelty 6.0

    Nectar fits small per-layer per-head neural networks via regression to predict attention outputs and normalizers, enabling constant-time inference independent of context length while preserving semantic generation quality.

  59. MBP-KT: Learning Global Collaborative Information from Meta-Behavioral Pattern for Enhanced Knowledge Tracing

    cs.AI 2026-05 unverdicted novelty 6.0

    MBP-KT uses meta-behavioral pattern sequences and a parameter-free extractor to inject global collaborative information into knowledge tracing models, consistently improving their performance on real datasets.

  60. Structured Recurrent Mixers for Massively Parallelized Sequence Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.

Reference graph

Works this paper leans on

122 extracted references · 122 canonical work pages · cited by 166 Pith papers · 12 internal anchors

  1. [1]

    Unitary Evolution Recurrent Neural Networks

    Martin Arjovsky, Amar Shah, and Yoshua Bengio. “Unitary Evolution Recurrent Neural Networks”. In:The Interna- tional Conference on Machine Learning (ICML) . 2016, pp. 1120–1128. 17

  2. [2]

    Effective Gene Expression Prediction from Sequence by Integrating Long-range Interactions

    Žiga Avsec, Vikram Agarwal, Daniel Visentin, Joseph R Ledsam, Agnieszka Grabska-Barwinska, Kyle R Taylor, Yannis Assael, John Jumper, Pushmeet Kohli, and David R Kelley. “Effective Gene Expression Prediction from Sequence by Integrating Long-range Interactions”. In: Nature Methods 18.10 (2021), pp. 1196–1203

  3. [3]

    Using Fast Weights to Attend to the Recent Past

    Jimmy Ba, Geoffrey E Hinton, Volodymyr Mnih, Joel Z Leibo, and Catalin Ionescu. “Using Fast Weights to Attend to the Recent Past”. In: Advances in Neural Information Processing Systems (NeurIPS) 29 (2016)

  4. [4]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. “Layer Normalization”. In:arXiv preprint arXiv:1607.06450 (2016)

  5. [5]

    Neural Machine Translation by Jointly Learning to Align and Translate

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural Machine Translation by Jointly Learning to Align and Translate”. In: The International Conference on Learning Representations (ICLR) . 2015

  6. [6]

    Strongly-typed Recurrent Neural Networks

    David Balduzzi and Muhammad Ghifary. “Strongly-typed Recurrent Neural Networks”. In:International Conference on Machine Learning. PMLR. 2016, pp. 1292–1300

  7. [7]

    Pythia: A Suite for Analyzing Large Language Models across Training and Scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. “Pythia: A Suite for Analyzing Large Language Models across Training and Scaling”. In: The International Conference on Machine Learning (ICML) . PMLR. 2023, pp. 2397–2430

  8. [8]

    PIQA: Reasoning about Physical Commonsense in Natural Language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. “PIQA: Reasoning about Physical Commonsense in Natural Language”. In: Proceedings of the AAAI conference on Artificial Intelligence . Vol. 34. 2020

  9. [9]

    S., Purohit, S., Reynolds, L., Tow, J., Wang, B., and Weinbach, S

    Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. “Gpt-NeoX-20B: An Open-source Autoregressive Language Model”. In:arXiv preprint arXiv:2204.06745 (2022)

  10. [10]

    Prefix Sums and Their Applications

    Guy E Blelloch. “Prefix Sums and Their Applications”. In: (1990)

  11. [11]

    Quasi-recurrent Neural Networks

    James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. “Quasi-recurrent Neural Networks”. In: arXiv preprint arXiv:1611.01576 (2016)

  12. [12]

    Language Models are Few-shot Learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language Models are Few-shot Learners”. In: Advances in Neural Information Processing Systems (NeurIPS) 33 (2020), pp. 1877–1901

  13. [13]

    Scaling transformer to 1m tokens and beyond with rmt,

    Aydar Bulatov, Yuri Kuratov, and Mikhail S Burtsev. “Scaling Transformer to 1M tokens and Beyond with RMT”. In: arXiv preprint arXiv:2304.11062 (2023)

  14. [14]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. “Generating Long Sequences with Sparse Transformers”. In: arXiv preprint arXiv:1904.10509 (2019)

  15. [15]

    Rethinking Attention with Performers

    Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. “Rethinking Attention with Performers”. In: The International Conference on Learning Representations (ICLR) . 2021

  16. [16]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. “PaLM: Scaling Language Modeling with Pathways”. In: Journal of Machine Learning Research 24.240 (2023), pp. 1–113. url: http://jmlr.org/papers/v24/22- 1144.html

  17. [17]

    Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

    Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling”. In: arXiv preprint arXiv:1412.3555 (2014)

  18. [18]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. In:arXiv preprint arXiv:1803.05457 (2018)

  19. [19]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning”. In:The International Conference on Learning Representations (ICLR) . 2024

  20. [20]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”. In:Advances in Neural Information Processing Systems (NeurIPS) . 2022

  21. [21]

    Hungry Hungry Hippos: Towards Language Modeling with State Space Models

    Tri Dao, Daniel Y Fu, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher Ré. “Hungry Hungry Hippos: Towards Language Modeling with State Space Models”. In:The International Conference on Learning Representations (ICLR). 2023

  22. [22]

    Language Modeling with Gated Convolutional Networks

    Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. “Language Modeling with Gated Convolutional Networks”. In: The International Conference on Machine Learning (ICML) . PMLR. 2017, pp. 933–941

  23. [23]

    SampleRNN

    DeepSound. SampleRNN. https://github.com/deepsound-project/samplernn-pytorch. 2017

  24. [24]

    LongNet: Scaling transformers to 1,000,000,000 tokens.arXiv preprint arXiv:2307.02486,

    Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, and Furu Wei. “LongNet: Scaling Transformers to 1,000,000,000 Tokens”. In:arXiv preprint arXiv:2307.02486 (2023). 18

  25. [25]

    Adversarial Audio Synthesis

    Chris Donahue, Julian McAuley, and Miller Puckette. “Adversarial Audio Synthesis”. In:The International Conference on Learning Representations (ICLR) . 2019

  26. [26]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”. In: The International Conference on Learning Representations (ICLR) . 2020

  27. [27]

    A Mathematical Framework for Transformer Circuits

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. “...

  28. [28]

    Block-State Transformer

    Mahan Fathi, Jonathan Pilault, Pierre-Luc Bacon, Christopher Pal, Orhan Firat, and Ross Goroshin. “Block-State Transformer”. In: arXiv preprint arXiv:2306.09539 (2023)

  29. [29]

    Multi-Head State Space Model for Speech Recognition

    Yassir Fathullah, Chunyang Wu, Yuan Shangguan, Junteng Jia, Wenhan Xiong, Jay Mahadeokar, Chunxi Liu, Yangyang Shi, Ozlem Kalinli, Mike Seltzer, and Mark J. F. Gales. “Multi-Head State Space Model for Speech Recognition”. In: Proc. INTERSPEECH 2023 . 2023, pp. 241–245. doi: 10.21437/Interspeech.2023-1036

  30. [30]

    Dynamic Causal Modelling

    Karl J Friston, Lee Harrison, and Will Penny. “Dynamic Causal Modelling”. In:Neuroimage 19.4 (2003), pp. 1273– 1302

  31. [31]

    Simple Hardware-efficient Long Convolutions for Sequence Modeling

    Daniel Y Fu, Elliot L Epstein, Eric Nguyen, Armin W Thomas, Michael Zhang, Tri Dao, Atri Rudra, and Christopher Ré. “Simple Hardware-efficient Long Convolutions for Sequence Modeling”. In:The International Conference on Machine Learning (ICML) (2023)

  32. [32]

    Approximation of Dynamical Systems by Continuous Time Recurrent Neural Networks

    Ken-ichi Funahashi and Yuichi Nakamura. “Approximation of Dynamical Systems by Continuous Time Recurrent Neural Networks”. In: Neural Networks 6.6 (1993), pp. 801–806

  33. [33]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. “The Pile: An 800GB Dataset of Diverse Text for Language Modeling”. In: arXiv preprint arXiv:2101.00027 (2020)

  34. [34]

    Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex

    Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A Framework for Few-shot Language Model Evaluation . Version v0.0.1. Sept. 2021. doi: 10.5281/zenodo.5371628. url: http...

  35. [35]

    It’s Raw! Audio Generation with State-Space Models

    Karan Goel, Albert Gu, Chris Donahue, and Christopher Ré. “It’s Raw! Audio Generation with State-Space Models”. In: The International Conference on Machine Learning (ICML) . 2022

  36. [36]

    HIPPO: Recurrent Memory with Optimal Polynomial Projections

    Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. “HIPPO: Recurrent Memory with Optimal Polynomial Projections”. In: Advances in Neural Information Processing Systems (NeurIPS) . 2020

  37. [37]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Albert Gu, Karan Goel, and Christopher Ré. “Efficiently Modeling Long Sequences with Structured State Spaces”. In: The International Conference on Learning Representations (ICLR) . 2022

  38. [38]

    Improving the Gating Mechanism of Recurrent Neural Networks

    Albert Gu, Caglar Gulcehre, Tom Le Paine, Matt Hoffman, and Razvan Pascanu. “Improving the Gating Mechanism of Recurrent Neural Networks”. In: The International Conference on Machine Learning (ICML) . 2020

  39. [39]

    On the Parameterization and Initialization of Diagonal State Space Models

    Albert Gu, Ankit Gupta, Karan Goel, and Christopher Ré. “On the Parameterization and Initialization of Diagonal State Space Models”. In: Advances in Neural Information Processing Systems (NeurIPS) . 2022

  40. [40]

    Combining Recurrent, Convolutional, and Continuous-time Models with the Linear State Space Layer

    Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. “Combining Recurrent, Convolutional, and Continuous-time Models with the Linear State Space Layer”. In:Advances in Neural Information Processing Systems (NeurIPS). 2021

  41. [41]

    How to Train Your HIPPO: State Space Models with Generalized Basis Projections

    Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, and Christopher Ré. “How to Train Your HIPPO: State Space Models with Generalized Basis Projections”. In: The International Conference on Learning Representations (ICLR) . 2023

  42. [42]

    Diagonal State Spaces are as Effective as Structured State Spaces

    Ankit Gupta, Albert Gu, and Jonathan Berant. “Diagonal State Spaces are as Effective as Structured State Spaces”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 22982–22994

  43. [43]

    Simplifying and Understanding State Space Models with Diagonal Linear RNNs

    Ankit Gupta, Harsh Mehta, and Jonathan Berant. “Simplifying and Understanding State Space Models with Diagonal Linear RNNs”. In: arXiv preprint arXiv:2212.00768 (2022)

  44. [44]

    HyperNetworks

    David Ha, Andrew Dai, and Quoc V. Le. “HyperNetworks”. In:The International Conference on Learning Representa- tions (ICLR). 2017

  45. [45]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. “Dream to Control: Learning Behaviors by Latent Imagination”. In: The International Conference on Learning Representations (ICLR) . 2020. 19

  46. [46]

    Liquid Structural State-Space Models

    Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, and Daniela Rus. “Liquid Structural State-Space Models”. In: The International Conference on Learning Representations (ICLR) . 2023

  47. [47]

    Recurrent Orthogonal Networks and Long-Memory Tasks

    Mikael Henaff, Arthur Szlam, and Yann LeCun. “Recurrent Orthogonal Networks and Long-Memory Tasks”. In: The International Conference on Machine Learning (ICML) . 2016

  48. [48]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. “Gaussian Error Linear Units (GELUs)”. In: arXiv preprint arXiv:1606.08415 (2016)

  49. [49]

    Untersuchungen zu dynamischen neuronalen Netzen

    Sepp Hochreiter. “Untersuchungen zu dynamischen neuronalen Netzen”. In: Diploma, Technische Universität München 91.1 (1991), p. 31

  50. [50]

    Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-term Dependencies

    Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber, et al. Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-term Dependencies. 2001

  51. [51]

    Long Short-Term Memory

    Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory”. In:Neural Computation 9.8 (1997), pp. 1735– 1780

  52. [52]

    An Empirical Analysis of Compute- Optimal Large Language Model Training

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. “An Empirical Analysis of Compute- Optimal Large Language Model Training”. In:Advances in Neural Information Processing Systems (NeurIPS) 35 (2022), pp. 30016–30030

  53. [53]

    Transformer Quality in Linear Time

    Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc Le. “Transformer Quality in Linear Time”. In:The International Conference on Machine Learning (ICML) . PMLR. 2022, pp. 9099–9117

  54. [54]

    Deep Learning for Time Series Classification: A Review

    Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre-Alain Muller. “Deep Learning for Time Series Classification: A Review”. In: Data Mining and Knowledge Discovery 33.4 (2019), pp. 917– 963

  55. [55]

    Data Movement is All You Need: A Case Study on Optimizing Transformers

    Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. “Data Movement is All You Need: A Case Study on Optimizing Transformers”. In: Proceedings of Machine Learning and Systems 3 (2021), pp. 711–732

  56. [56]

    Gated Orthogonal Recurrent Units: On Learning to Forget

    Li Jing, Caglar Gulcehre, John Peurifoy, Yichen Shen, Max Tegmark, Marin Soljacic, and Yoshua Bengio. “Gated Orthogonal Recurrent Units: On Learning to Forget”. In: Neural Computation 31.4 (2019), pp. 765–783

  57. [57]

    A New Approach to Linear Filtering and Prediction Problems

    Rudolph Emil Kalman. “A New Approach to Linear Filtering and Prediction Problems”. In: (1960)

  58. [58]

    Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention”. In:International Conference on Machine Learning . PMLR. 2020, pp. 5156–5165

  59. [59]

    Linear Dynamical Systems as a Core Computational Primitive

    Shiva Kaul. “Linear Dynamical Systems as a Core Computational Primitive”. In:Advances in Neural Information Processing Systems 33 (2020), pp. 16808–16820

  60. [60]

    DiffWave: A Versatile Diffusion Model for Audio Synthesis

    Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. “DiffWave: A Versatile Diffusion Model for Audio Synthesis”. In:International Conference on Learning Representations . 2021

  61. [61]

    Time-Parameterized Convolutional Neural Networks for Irregularly Sampled Time Series

    Chrysoula Kosma, Giannis Nikolentzos, and Michalis Vazirgiannis. “Time-Parameterized Convolutional Neural Networks for Irregularly Sampled Time Series”. In: arXiv preprint arXiv:2308.03210 (2023)

  62. [62]

    ImageNet Classification with Deep Convolutional Neural Networks

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet Classification with Deep Convolutional Neural Networks”. In: Advances in Neural Information Processing Systems (NeurIPS) 25 (2012)

  63. [63]

    When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute

    Tao Lei. “When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute”. In:Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing . 2021, pp. 7633–7648

  64. [64]

    Simple Recurrent Units for Highly Parallelizable Recur- rence

    Tao Lei, Yu Zhang, Sida I Wang, Hui Dai, and Yoav Artzi. “Simple Recurrent Units for Highly Parallelizable Recurrence”. In: arXiv preprint arXiv:1709.02755 (2017)

  65. [65]

    Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and Unitary Group

    Mario Lezcano-Casado and David Martínez-Rubio. “Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and Unitary Group”. In: The International Conference on Machine Learning (ICML). 2019

  66. [66]

    What Makes Convolutional Models Great on Long Sequence Modeling?

    Yuhong Li, Tianle Cai, Yi Zhang, Deming Chen, and Debadeepta Dey. “What Makes Convolutional Models Great on Long Sequence Modeling?” In: The International Conference on Learning Representations (ICLR) . 2023

  67. [67]

    Time-aware Large Kernel Convolutions

    Vasileios Lioutas and Yuhong Guo. “Time-aware Large Kernel Convolutions”. In:The International Conference on Machine Learning (ICML). PMLR. 2020, pp. 6172–6183

  68. [68]

    Structured State Space Models for In-Context Reinforcement Learning

    Chris Lu, Yannick Schroecker, Albert Gu, Emilio Parisotto, Jakob Foerster, Satinder Singh, and Feryal Behbahani. “Structured State Space Models for In-Context Reinforcement Learning”. In:Advances in Neural Information Processing Systems (NeurIPS). 2023

  69. [69]

    Focus Your Attention (with Adaptive IIR Filters)

    Shahar Lutati, Itamar Zimerman, and Lior Wolf. “Focus Your Attention (with Adaptive IIR Filters)”. In:arXiv preprint arXiv:2305.14952 (2023). 20

  70. [70]

    Mega: Moving Average Equipped Gated Attention

    Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. “Mega: Moving Average Equipped Gated Attention”. In:The International Conference on Learning Representations (ICLR). 2023

  71. [71]

    Parallelizing Linear Recurrent Neural Nets Over Sequence Length

    Eric Martin and Chris Cundy. “Parallelizing Linear Recurrent Neural Nets Over Sequence Length”. In:The Interna- tional Conference on Learning Representations (ICLR) . 2018

  72. [72]

    SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

    Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. “SampleRNN: An Unconditional End-to-End Neural Audio Generation Model”. In:The International Conference on Learning Representations (ICLR) . 2017

  73. [73]

    Long Range Language Modeling via Gated State Spaces

    Harsh Mehta, Ankit Gupta, Ashok Cutkosky, and Behnam Neyshabur. “Long Range Language Modeling via Gated State Spaces”. In: The International Conference on Learning Representations (ICLR) . 2023

  74. [74]

    Efficient Orthogonal Parametrisation of Recurrent Neural Networks using Householder Reflections

    Zakaria Mhammedi, Andrew Hellicar, Ashfaqur Rahman, and James Bailey. “Efficient Orthogonal Parametrisation of Recurrent Neural Networks using Householder Reflections”. In:International Conference on Machine Learning . PMLR. 2017, pp. 2401–2409

  75. [75]

    S4ND: Modeling Images and Videos as Multidimensional Signals with State Spaces

    Eric Nguyen, Karan Goel, Albert Gu, Gordon Downs, Preey Shah, Tri Dao, Stephen Baccus, and Christopher Ré. “S4ND: Modeling Images and Videos as Multidimensional Signals with State Spaces”. In:Advances in Neural Information Processing Systems (NeurIPS) . 2022

  76. [76]

    HyenaDNA: Long-range Genomic Sequence Modeling at Single Nucleotide Resolution

    Eric Nguyen, Michael Poli, Marjan Faizi, Armin Thomas, Callum Birch-Sykes, Michael Wornow, Aman Patel, Clayton Rabideau, Stefano Massaroli, Yoshua Bengio, et al. “HyenaDNA: Long-range Genomic Sequence Modeling at Single Nucleotide Resolution”. In: Advances in Neural Information Processing Systems (NeurIPS) . 2023

  77. [77]

    In-context Learning and Induction Heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

  78. [78]

    WaveNet: A Generative Model for Raw Audio

    Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. “WaveNet: A Generative Model for Raw Audio”. In: arXiv preprint arXiv:1609.03499 (2016)

  79. [79]

    Resurrecting Recurrent Neural Networks for Long Sequences

    Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. “Resurrecting Recurrent Neural Networks for Long Sequences”. In:The International Conference on Machine Learning (ICML). 2023

  80. [80]

    The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context

    Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc-Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. “The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context”. In:Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics . 2016, pp. 1525–1534

Showing first 80 references.