pith. sign in

super hub Mixed citations

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Mixed citation behavior. Most common role is background (68%).

125 Pith papers citing it
Background 68% of classified citations
abstract

While Transformers have been the main architecture behind deep learning's success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale. We show that these families of models are actually quite closely related, and develop a rich framework of theoretical connections between SSMs and variants of attention, connected through various decompositions of a well-studied class of structured semiseparable matrices. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba's selective SSM that is 2-8X faster, while continuing to be competitive with Transformers on language modeling.

hub tools

citation-role summary

background 22 method 4 baseline 1 extension 1

citation-polarity summary

claims ledger

  • abstract While Transformers have been the main architecture behind deep learning's success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale. We show that these families of models are actually quite closely related, and develop a rich framework of theoretical connections between SSMs and variants of attention, connected through various decompositions of a well-studied class of structured semiseparable matrices. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose c

authors

co-cited works

clear filters

representative citing papers

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0 · 2 refs

WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

Morphing into Hybrid Attention Models

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

FlashMorph formulates hybrid layer selection as budget-constrained optimization, trains per-layer gates on synthetic retrieval data with linearization regularization, then discretizes and distills to produce efficient hybrid architectures.

Tapered Language Models

cs.LG · 2026-06-22 · unverdicted · novelty 7.0

Tapered Language Models monotonically decrease MLP width across depth with a cosine schedule, yielding better perplexity and downstream performance than uniform-width baselines across multiple architectures and scales at no extra cost.

A Verifiable Search Is Not a Learnable Chain-of-Thought

cs.LG · 2026-06-20 · unverdicted · novelty 7.0

Verifiable search procedures cannot be learned as forward chain-of-thought by language models; they instead learn memorization, verification, or require precomputed catalogs.

Forget Attention: Importance-Aware Attention Is All You Need

cs.AI · 2026-06-01 · unverdicted · novelty 7.0

SISA adds an SSM importance term inside the attention score and runs the full operation as one SDPA call on augmented Q/K vectors, reporting better LAMBADA and perfect NIAH at small scale.

DSSP: Diffusion State Space Policy with Full-History Encoding

cs.RO · 2026-05-14 · conditional · novelty 7.0

DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size in robot manipulation.

TIDES: Implicit Time-Awareness in Selective State Space Models

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and Physiome-ODE benchmarks.

Rethink MAE with Linear Time-Invariant Dynamics

cs.CV · 2026-04-29 · unverdicted · novelty 7.0

Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

cs.LG · 2026-04-17 · unverdicted · novelty 7.0

Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.

citing papers explorer

Showing 22 of 22 citing papers after filters.

  • TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles cs.CV · 2026-05-12 · unverdicted · none · ref 8 · internal anchor

    TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.

  • TIDES: Implicit Time-Awareness in Selective State Space Models cs.LG · 2026-05-10 · unverdicted · none · ref 7 · internal anchor

    TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and Physiome-ODE benchmarks.

  • FRACTAL: SSM with Fractional Recurrent Architecture for Computational Temporal Analysis of Long Sequences cs.AI · 2026-05-09 · unverdicted · none · ref 47 · internal anchor

    FRACTAL integrates fractional recurrent architecture into SSMs using a tunable singularity index to capture multi-scale temporal features, reporting 87.11% average on Long Range Arena and outperforming S5.

  • Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control cs.LG · 2026-05-08 · unverdicted · none · ref 11 · internal anchor

    Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.

  • PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization cs.LG · 2026-05-07 · unverdicted · none · ref 12 · 2 links · internal anchor

    PairAlign learns compact variable-length token sequences for audio via self-alignment on paired content-preserving views, achieving 55% fewer archive tokens than VQ while preserving edit-distance retrieval at 12.71 tokens/s.

  • MambaNetBurst: Direct Byte-level Network Traffic Classification without Tokenization or Pretraining cs.CR · 2026-05-11 · unverdicted · none · ref 16 · internal anchor

    A compact Mamba-2 model performs end-to-end byte-level network traffic classification without tokenization or pre-training and remains competitive with substantially larger pre-trained systems.

  • Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators cs.LG · 2026-05-07 · unverdicted · none · ref 7 · internal anchor

    Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.

  • The Impossibility Triangle of Long-Context Modeling cs.CL · 2026-05-06 · unverdicted · none · ref 8 · internal anchor

    No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.

  • HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models cs.LG · 2026-04-24 · unverdicted · none · ref 8 · internal anchor

    HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.

  • Sonata: A Hybrid World Model for Inertial Kinematics under Clinical Data Scarcity cs.LG · 2026-04-20 · unverdicted · none · ref 44 · internal anchor

    Sonata is a small hybrid world model pre-trained to predict future IMU states that outperforms autoregressive baselines on clinical discrimination, fall-risk prediction, and cross-cohort transfer while fitting on-device wearables.

  • Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction cs.CV · 2026-04-09 · unverdicted · none · ref 15 · internal anchor

    Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.

  • In-Place Test-Time Training cs.LG · 2026-04-07 · conditional · none · ref 16 · internal anchor

    In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

  • Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space cs.CL · 2026-04-06 · unverdicted · none · ref 102 · internal anchor

    PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.

  • Computer Architecture's AlphaZero Moment: Automated Discovery in an Encircled World cs.AR · 2026-03-31 · conditional · none · ref 38 · internal anchor

    Automated architectural discovery engines can outperform human design teams by exploring massive design spaces and compressing development cycles from months to weeks.

  • Kimi Linear: An Expressive, Efficient Attention Architecture cs.CL · 2025-10-30 · unverdicted · none · ref 16 · internal anchor

    Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

  • MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention cs.CL · 2025-06-16 · unverdicted · none · ref 7 · internal anchor

    MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training in three weeks on 512 GPUs.

  • Titans: Learning to Memorize at Test Time cs.LG · 2024-12-31 · unverdicted · none · ref 27 · internal anchor

    Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.

  • Mela: Test-Time Memory Consolidation based on Transformation Hypothesis cs.CL · 2026-05-11 · unverdicted · none · ref 7 · internal anchor

    Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.

  • Kaczmarz Linear Attention cs.LG · 2026-05-09 · unverdicted · none · ref 8 · internal anchor

    Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack performance.

  • Cubit: Token Mixer with Kernel Ridge Regression cs.LG · 2026-05-07 · unverdicted · none · ref 15 · 2 links · internal anchor

    Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.

  • Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving cs.DC · 2026-05-07 · unverdicted · none · ref 7 · internal anchor

    Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.

  • Sessa: Selective State Space Attention cs.LG · 2026-04-20 · unverdicted · none · ref 14 · internal anchor

    Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.