{"work":{"id":"d8eba076-0449-4f6a-aae1-5a7260677f0f","openalex_id":null,"doi":null,"arxiv_id":"2405.21060","raw_key":null,"title":"Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality","authors":null,"authors_text":"Tri Dao, Albert Gu","year":2024,"venue":"cs.LG","abstract":"While Transformers have been the main architecture behind deep learning's success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale. We show that these families of models are actually quite closely related, and develop a rich framework of theoretical connections between SSMs and variants of attention, connected through various decompositions of a well-studied class of structured semiseparable matrices. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba's selective SSM that is 2-8X faster, while continuing to be competitive with Transformers on language modeling.","external_url":"https://arxiv.org/abs/2405.21060","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-14T22:08:15.259490+00:00","pith_arxiv_id":"2405.21060","created_at":"2026-05-09T22:49:16.011462+00:00","updated_at":"2026-05-14T22:08:15.259490+00:00","title_quality_ok":true,"display_title":"Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality","render_title":"Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality"},"hub":{"state":{"work_id":"d8eba076-0449-4f6a-aae1-5a7260677f0f","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":50,"external_cited_by_count":null,"distinct_field_count":9,"first_pith_cited_at":"2024-12-31T22:32:03+00:00","last_pith_cited_at":"2026-05-12T09:25:54+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-15T00:36:15.064221+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"extension","n":1}],"polarity_counts":[{"context_polarity":"unclear","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T15:01:44.819699+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Mamba: Linear-Time Sequence Modeling with Selective State Spaces","work_id":"4ee75248-1199-492c-a52f-6661e0f4adff","shared_citers":25},{"title":"Efficiently Modeling Long Sequences with Structured State Spaces","work_id":"4150b761-b8bf-4d9b-a2f8-cb2d1b73d378","shared_citers":14},{"title":"Retentive Network: A Successor to Transformer for Large Language Models","work_id":"5b0449ac-92b0-41f2-8b4f-586c2b5a08b6","shared_citers":11},{"title":"Gated Delta Networks: Improving Mamba2 with Delta Rule","work_id":"884939d3-e283-4625-bff4-b7e0e4cc2a6e","shared_citers":10},{"title":"Jamba: A Hybrid Transformer-Mamba Language Model","work_id":"129df0fe-8a66-4077-8991-3557cfa38274","shared_citers":10},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":9},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":8},{"title":"Longformer: The Long-Document Transformer","work_id":"abea7a44-6668-4de7-aab6-f53a6e5aa088","shared_citers":8},{"title":"RWKV: Reinventing RNNs for the Transformer Era","work_id":"524dc80d-f4ef-4f89-bf1a-9a8c1e4b6a81","shared_citers":8},{"title":"Gated linear attention transformers with hardware-efficient training","work_id":"65a18a30-6e80-4b64-a026-bb0368e38872","shared_citers":7},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":7},{"title":"Attention Is All You Need","work_id":"baafb5a2-5272-43bc-932f-09fa9ffe5316","shared_citers":6},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":6},{"title":"Kimi Linear: An Expressive, Efficient Attention Architecture","work_id":"b2f9f1cd-c39c-4dbc-8637-f575681cdc01","shared_citers":6},{"title":"Learning to (learn at test time): Rnns with expressive hidden states","work_id":"c682430c-e7a2-4699-b82d-55287448dbba","shared_citers":6},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":6},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":5},{"title":"Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free","work_id":"35cc586b-44f1-4948-a84b-866e8335e649","shared_citers":5},{"title":"Griffin: Mixing gated linear recurrences with local attention for efficient language models","work_id":"546b02db-26f0-4dac-b50e-912e7f4e181c","shared_citers":5},{"title":"Hungry hungry hippos: Towards language modeling with state space models","work_id":"d5653b0c-f12c-4141-9343-d65df1fb4214","shared_citers":5},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":5},{"title":"Reformer: The Efficient Transformer","work_id":"eb3dbae4-931f-40ab-a37b-507a35f42712","shared_citers":5},{"title":"Rethinking Attention with Performers","work_id":"4c26d308-8b72-4a98-8e73-950617a75f50","shared_citers":5},{"title":"RoFormer: Enhanced Transformer with Rotary Position Embedding","work_id":"4e5eee26-cd04-4c7a-988f-3e6d1a1f0eb9","shared_citers":5}],"time_series":[{"n":2,"year":2025},{"n":46,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T15:11:52.046006+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T15:01:38.812040+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality","claims":[{"claim_text":"While Transformers have been the main architecture behind deep learning's success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale. We show that these families of models are actually quite closely related, and develop a rich framework of theoretical connections between SSMs and variants of attention, connected through various decompositions of a well-studied class of structured semiseparable matrices. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose c","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T15:11:52.090325+00:00"}},"summary":{"title":"Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality","claims":[{"claim_text":"While Transformers have been the main architecture behind deep learning's success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale. We show that these families of models are actually quite closely related, and develop a rich framework of theoretical connections between SSMs and variants of attention, connected through various decompositions of a well-studied class of structured semiseparable matrices. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose c","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Mamba: Linear-Time Sequence Modeling with Selective State Spaces","work_id":"4ee75248-1199-492c-a52f-6661e0f4adff","shared_citers":25},{"title":"Efficiently Modeling Long Sequences with Structured State Spaces","work_id":"4150b761-b8bf-4d9b-a2f8-cb2d1b73d378","shared_citers":14},{"title":"Retentive Network: A Successor to Transformer for Large Language Models","work_id":"5b0449ac-92b0-41f2-8b4f-586c2b5a08b6","shared_citers":11},{"title":"Gated Delta Networks: Improving Mamba2 with Delta Rule","work_id":"884939d3-e283-4625-bff4-b7e0e4cc2a6e","shared_citers":10},{"title":"Jamba: A Hybrid Transformer-Mamba Language Model","work_id":"129df0fe-8a66-4077-8991-3557cfa38274","shared_citers":10},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":9},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":8},{"title":"Longformer: The Long-Document Transformer","work_id":"abea7a44-6668-4de7-aab6-f53a6e5aa088","shared_citers":8},{"title":"RWKV: Reinventing RNNs for the Transformer Era","work_id":"524dc80d-f4ef-4f89-bf1a-9a8c1e4b6a81","shared_citers":8},{"title":"Gated linear attention transformers with hardware-efficient training","work_id":"65a18a30-6e80-4b64-a026-bb0368e38872","shared_citers":7},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":7},{"title":"Attention Is All You Need","work_id":"baafb5a2-5272-43bc-932f-09fa9ffe5316","shared_citers":6},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":6},{"title":"Kimi Linear: An Expressive, Efficient Attention Architecture","work_id":"b2f9f1cd-c39c-4dbc-8637-f575681cdc01","shared_citers":6},{"title":"Learning to (learn at test time): Rnns with expressive hidden states","work_id":"c682430c-e7a2-4699-b82d-55287448dbba","shared_citers":6},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":6},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":5},{"title":"Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free","work_id":"35cc586b-44f1-4948-a84b-866e8335e649","shared_citers":5},{"title":"Griffin: Mixing gated linear recurrences with local attention for efficient language models","work_id":"546b02db-26f0-4dac-b50e-912e7f4e181c","shared_citers":5},{"title":"Hungry hungry hippos: Towards language modeling with state space models","work_id":"d5653b0c-f12c-4141-9343-d65df1fb4214","shared_citers":5},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":5},{"title":"Reformer: The Efficient Transformer","work_id":"eb3dbae4-931f-40ab-a37b-507a35f42712","shared_citers":5},{"title":"Rethinking Attention with Performers","work_id":"4c26d308-8b72-4a98-8e73-950617a75f50","shared_citers":5},{"title":"RoFormer: Enhanced Transformer with Rotary Position Embedding","work_id":"4e5eee26-cd04-4c7a-988f-3e6d1a1f0eb9","shared_citers":5}],"time_series":[{"n":2,"year":2025},{"n":46,"year":2026}],"dependency_candidates":[]},"authors":[]}}