pith. machine review for the scientific record. sign in

arxiv: 2405.21060 · v1 · submitted 2024-05-31 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Albert Gu, Tri Dao

Authors on Pith no claims yet

Pith reviewed 2026-05-11 12:10 UTC · model grok-4.3

classification 💻 cs.LG
keywords state space modelstransformersmambasemiseparable matricesattentionefficient algorithmslanguage modelingduality
0
0 comments X

The pith

Transformers and state-space models share a common structure through decompositions of semiseparable matrices, allowing a faster Mamba-2 model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Transformers and SSMs such as Mamba are linked by viewing their computations as different decompositions of the same structured semiseparable matrices. This state space duality framework reveals that attention is a special case of an SSM and supplies new algorithms for both. The authors apply the connection to create Mamba-2, a refined selective SSM whose core layer runs 2-8 times faster than the original Mamba while staying competitive with Transformers on language modeling. A sympathetic reader would care because the result collapses the apparent opposition between recurrent and attention-based architectures into interchangeable views of one matrix family. If the claim holds, model designers gain a single set of tools for trading off speed, memory, and expressivity without starting from scratch.

Core claim

Transformers and SSMs are closely related, connected through various decompositions of structured semiseparable matrices. The state space duality framework lets us design Mamba-2, whose core layer refines Mamba's selective SSM to be 2-8X faster while remaining competitive with Transformers on language modeling.

What carries the argument

State space duality (SSD) framework, which equates variants of attention and selective SSMs through structured semiseparable matrix decompositions.

If this is right

  • Mamba-2 achieves 2-8X faster inference and training than the prior Mamba selective SSM.
  • Mamba-2 maintains competitive performance with Transformers on language modeling tasks.
  • The duality supplies efficient algorithms for both SSMs and attention variants.
  • New architectures can be built by choosing different decompositions within the same semiseparable matrix family.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The shared matrix view suggests hardware kernels written for one architecture can be reused for the other with only a change of decomposition.
  • Results from linear algebra on semiseparable matrices, such as fast inversion or low-rank updates, could be ported directly to improve long-sequence scaling in either model family.
  • Hybrid layers that switch between attention-style and SSM-style decompositions within a single network become a natural design option rather than an ad-hoc combination.

Load-bearing premise

The decompositions of structured semiseparable matrices preserve the modeling capacity and training dynamics of the original selective SSM.

What would settle it

Running Mamba-2 on standard language modeling benchmarks and finding either no measurable speedup over Mamba or a clear drop in perplexity relative to both Mamba and Transformers would show the decompositions fail to deliver the claimed benefits in practice.

read the original abstract

While Transformers have been the main architecture behind deep learning's success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale. We show that these families of models are actually quite closely related, and develop a rich framework of theoretical connections between SSMs and variants of attention, connected through various decompositions of a well-studied class of structured semiseparable matrices. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba's selective SSM that is 2-8X faster, while continuing to be competitive with Transformers on language modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper establishes a theoretical framework called State Space Duality (SSD) that unifies Transformers and state space models (SSMs) like Mamba by showing connections through decompositions of structured semiseparable matrices. It proposes Mamba-2, a new architecture based on a refined selective SSM that achieves significant speedups (2-8X) over previous models while remaining competitive with Transformers on language modeling benchmarks.

Significance. If the SSD framework provides exact equivalences and the proposed algorithms deliver the claimed efficiency gains without sacrificing modeling capacity, this work has the potential to advance the field by offering a unified view of attention and SSMs, leading to more efficient and scalable sequence models. The development of generalized models and efficient algorithms is a strength, particularly if supported by rigorous derivations.

major comments (2)
  1. [§3] §3 (SSD framework and matrix decompositions): The claim that SSD yields an equivalent selective SSM layer must be shown to hold exactly for input-dependent A/B/C matrices. The manuscript should provide a formal derivation or proof that the structured semiseparable factorization introduces no hidden low-rank or block-diagonal approximations, as any such assumption would risk altering long-range, input-dependent recall dynamics.
  2. [§5] §5 (Mamba-2 architecture and experiments): To substantiate that modeling capacity is preserved, include direct comparisons of Mamba-2 against the original selective SSM on tasks emphasizing long-context input-dependent memory, with ablations isolating the SSD algorithm from implementation optimizations to confirm the reported 2-8X speedups.
minor comments (2)
  1. [Abstract] Abstract: The phrasing 'a refinement of Mamba's selective SSM' is vague; specify the precise modifications to the core layer.
  2. [Notation] Notation and figures: Ensure uniform notation for semiseparable matrices and improve clarity of any matrix decomposition diagrams.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and recognition of the potential impact of the SSD framework and Mamba-2. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (SSD framework and matrix decompositions): The claim that SSD yields an equivalent selective SSM layer must be shown to hold exactly for input-dependent A/B/C matrices. The manuscript should provide a formal derivation or proof that the structured semiseparable factorization introduces no hidden low-rank or block-diagonal approximations, as any such assumption would risk altering long-range, input-dependent recall dynamics.

    Authors: We thank the referee for this important clarification request. Section 3 derives the SSD framework by expressing the selective SSM recurrence as a structured semiseparable matrix and showing its duality to an attention-like form. The construction incorporates input-dependent A, B, and C directly into the diagonal blocks and low-rank factors of the semiseparable decomposition, preserving the exact recurrence without additional low-rank or block-diagonal approximations. To address the concern rigorously, we will add a formal proof in the appendix of the revised manuscript that verifies the equivalence holds exactly for arbitrary input-dependent parameters, with explicit steps showing that no hidden assumptions are introduced that would alter long-range dynamics. revision: yes

  2. Referee: [§5] §5 (Mamba-2 architecture and experiments): To substantiate that modeling capacity is preserved, include direct comparisons of Mamba-2 against the original selective SSM on tasks emphasizing long-context input-dependent memory, with ablations isolating the SSD algorithm from implementation optimizations to confirm the reported 2-8X speedups.

    Authors: We agree that targeted experiments would further substantiate the preservation of modeling capacity. The current manuscript shows Mamba-2 remains competitive with Transformers on language modeling while delivering the reported speedups over prior SSMs. In the revision, we will add direct comparisons of Mamba-2 against the original selective SSM (Mamba) on long-context tasks focused on input-dependent memory, such as associative recall and long-range dependency benchmarks. We will also include ablations that isolate the SSD algorithmic improvements from low-level implementation optimizations, thereby confirming that the 2-8X speedups arise from the structured duality rather than engineering alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on independent matrix decompositions

full rationale

The paper establishes connections between SSMs and attention variants by decomposing structured semiseparable matrices, then uses this SSD framework to refine the selective SSM into Mamba-2 with claimed speedups. No step equates a prediction to its fitted input, renames a known result as novel unification, or reduces the central claim to a self-citation chain. Prior Mamba work is cited for context on selectivity, but the duality, decompositions, and algorithm derivations are presented as self-contained linear-algebra results with explicit matrix constructions that do not presuppose the target architecture or performance claims. The framework remains falsifiable via direct implementation and benchmarking against the original recurrence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the framework rests on the mathematical properties of structured semiseparable matrices, which are treated as standard background.

pith-pipeline@v0.9.0 · 5413 in / 952 out tokens · 30947 ms · 2026-05-11T12:10:34.468343+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 50 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Selection, Not Fusion: Radar-Modulated State Space Models for Radar-Camera Depth Estimation

    cs.CV 2026-05 unverdicted novelty 7.0

    Radar-Modulated Selection perturbs only the step size Δ and readout C parameters inside Mamba's selective scan with radar data while keeping other components image-only, yielding state-of-the-art depth estimation on n...

  2. TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles

    cs.CV 2026-05 unverdicted novelty 7.0

    TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.

  3. TIDES: Implicit Time-Awareness in Selective State Space Models

    cs.LG 2026-05 unverdicted novelty 7.0

    TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and P...

  4. FRACTAL: SSM with Fractional Recurrent Architecture for Computational Temporal Analysis of Long Sequences

    cs.AI 2026-05 unverdicted novelty 7.0

    FRACTAL integrates fractional recurrent architecture into SSMs using a tunable singularity index to capture multi-scale temporal features, reporting 87.11% average on Long Range Arena and outperforming S5.

  5. Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control

    cs.LG 2026-05 unverdicted novelty 7.0

    Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.

  6. PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

    cs.LG 2026-05 unverdicted novelty 7.0

    PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...

  7. Rethink MAE with Linear Time-Invariant Dynamics

    cs.CV 2026-04 unverdicted novelty 7.0

    Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.

  8. Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

    cs.LG 2026-04 unverdicted novelty 7.0

    Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.

  9. The UNDO Flip-Flop: A Controlled Probe for Reversible Semantic State Management in State Space Model

    cs.LG 2026-04 unverdicted novelty 7.0

    Mamba-2 models fail to learn reversible state retrieval in the UNDO Flip-Flop task, defaulting to a toggle heuristic and achieving only 41% accuracy under adversarial conditions.

  10. S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

    cs.CL 2026-04 conditional novelty 7.0

    S0 tuning optimizes initial recurrent states in hybrid models to outperform LoRA with zero inference cost on HumanEval and partial cross-domain transfer.

  11. The Randomness Floor: Measuring Intrinsic Non-Randomness in Language Model Token Distributions

    cs.CL 2026-03 unverdicted novelty 7.0

    Language models have an intrinsic randomness floor: transformers show ~0.30 entropic deviation from uniform on neutral prompts, accounting for 88-93% of observed non-randomness, while state-space models exhibit twice ...

  12. Freeze Deep, Train Shallow: Interpretable Layer Allocation for Continued Pre-Training

    cs.CL 2026-05 unverdicted novelty 6.0

    Freezing deep layers and training shallow layers during continued pre-training of LLMs outperforms full fine-tuning and the opposite allocation on C-Eval and CMMLU, guided by a new layer-sensitivity diagnostic.

  13. MambaNetBurst: Direct Byte-level Network Traffic Classification without Tokenization or Pretraining

    cs.CR 2026-05 unverdicted novelty 6.0

    A compact Mamba-2 model performs end-to-end byte-level network traffic classification without tokenization or pre-training and remains competitive with substantially larger pre-trained systems.

  14. RT-Transformer: The Transformer Block as a Spherical State Estimator

    cs.LG 2026-05 unverdicted novelty 6.0

    Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.

  15. Structured Recurrent Mixers for Massively Parallelized Sequence Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.

  16. Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators

    cs.LG 2026-05 unverdicted novelty 6.0

    Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.

  17. Cubit: Token Mixer with Kernel Ridge Regression

    cs.LG 2026-05 unverdicted novelty 6.0

    Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.

  18. The Impossibility Triangle of Long-Context Modeling

    cs.CL 2026-05 unverdicted novelty 6.0

    No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.

  19. Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

    cs.CL 2026-04 unverdicted novelty 6.0

    HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.

  20. NAKUL-Med: Spectral-Graph State Space Models with Dynamics Kernels for Medical Signals

    eess.SP 2026-04 unverdicted novelty 6.0

    NAKUL achieves 91.7% accuracy on motor imagery EEG with 28% fewer parameters than EEG-Conformer by using dynamic kernel generation, spectral context modeling, and graph-guided spatial attention.

  21. HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models

    cs.LG 2026-04 unverdicted novelty 6.0

    HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.

  22. HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

    cs.AI 2026-04 unverdicted novelty 6.0

    HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

  23. M$^{2}$GRPO: Mamba-based Multi-Agent Group Relative Policy Optimization for Biomimetic Underwater Robots Pursuit

    cs.RO 2026-04 unverdicted novelty 6.0

    M²GRPO uses a Mamba-based policy and normalized group-relative advantages under CTDE to achieve higher pursuit success and capture efficiency than MAPPO and recurrent baselines in simulations and pool tests.

  24. Sonata: A Hybrid World Model for Inertial Kinematics under Clinical Data Scarcity

    cs.LG 2026-04 unverdicted novelty 6.0

    Sonata is a small hybrid world model pre-trained to predict future IMU states that outperforms autoregressive baselines on clinical discrimination, fall-risk prediction, and cross-cohort transfer while fitting on-devi...

  25. MambaBack: Bridging Local Features and Global Contexts in Whole Slide Image Analysis

    cs.CV 2026-04 conditional novelty 6.0

    MambaBack is a hybrid Mamba-CNN model with Hilbert sampling and chunked inference that reports better performance than seven prior methods on five whole-slide image datasets.

  26. Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective

    cs.CV 2026-04 unverdicted novelty 6.0

    The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...

  27. Parcae: Scaling Laws For Stable Looped Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...

  28. Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.

  29. Optimal Decay Spectra for Linear Recurrences

    cs.LG 2026-04 unverdicted novelty 6.0

    PoST reparameterizes decay spectra in linear recurrences with geometric log-spacing and position-adaptive scaling to achieve O(exp(-cN/log t)) decay, improving zero-shot language modeling and long-context retrieval ac...

  30. In-Place Test-Time Training

    cs.LG 2026-04 conditional novelty 6.0

    In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

  31. Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space

    cs.CL 2026-04 unverdicted novelty 6.0

    PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.

  32. Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

    cs.LG 2026-04 unverdicted novelty 6.0

    Stochastic training with random cross-layer KV attention enables depth-wise cache sharing in transformers, cutting memory footprint while preserving or improving performance.

  33. Attention to Mamba: A Recipe for Cross-Architecture Distillation

    cs.CL 2026-04 unverdicted novelty 6.0

    A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.

  34. Computer Architecture's AlphaZero Moment: Automated Discovery in an Encircled World

    cs.AR 2026-03 conditional novelty 6.0

    Automated architectural discovery engines can outperform human design teams by exploring massive design spaces and compressing development cycles from months to weeks.

  35. Kimi Linear: An Expressive, Efficient Attention Architecture

    cs.CL 2025-10 unverdicted novelty 6.0

    Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

  36. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    cs.CL 2025-06 unverdicted novelty 6.0

    MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...

  37. Titans: Learning to Memorize at Test Time

    cs.LG 2024-12 unverdicted novelty 6.0

    Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.

  38. Mela: Test-Time Memory Consolidation based on Transformation Hypothesis

    cs.CL 2026-05 unverdicted novelty 5.0

    Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.

  39. Kaczmarz Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...

  40. Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving

    cs.DC 2026-05 unverdicted novelty 5.0

    Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.

  41. Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models

    cs.LG 2026-04 unverdicted novelty 5.0

    Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.

  42. Reasoning Primitives in Hybrid and Non-Hybrid LLMs

    cs.CL 2026-04 unverdicted novelty 5.0

    Reasoning augmentation extends the difficulty range for both architectures, but hybrid models stay robust longer than transformers as sequential dependence increases in state-based recall tasks.

  43. LayerTracer: A Joint Task-Particle and Vulnerable-Layer Analysis framework for Arbitrary Large Language Model Architectures

    cs.CL 2026-04 unverdicted novelty 5.0

    LayerTracer defines task particles as the first layer where target token probability rises sharply and vulnerable layers via maximum JS divergence after masking, showing task particles in deep layers and greater robus...

  44. FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control

    cs.LG 2026-04 unverdicted novelty 5.0

    FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.

  45. Sessa: Selective State Space Attention

    cs.LG 2026-04 unverdicted novelty 5.0

    Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.

  46. Hypergraph-State Collaborative Reasoning for Multi-Object Tracking

    cs.CV 2026-04 unverdicted novelty 5.0

    HyperSSM integrates hypergraphs and state space models to let correlated objects mutually refine motion estimates, stabilizing trajectories under noise and occlusion for state-of-the-art multi-object tracking.

  47. COREY: Entropy-Guided Runtime Chunk Scheduling for Selective Scan Kernels

    cs.CV 2026-04 unverdicted novelty 5.0

    COREY maps activation entropy to chunk sizes for SSM kernels, matching static-oracle latency at kernel level with 3.9-4.4x speedups over baselines but adding overhead that prevents end-to-end gains while preserving ex...

  48. CARE-ECG: Causal Agent-based Reasoning for Explainable and Counterfactual ECG Interpretation

    cs.LG 2026-04 unverdicted novelty 5.0

    CARE-ECG unifies ECG representation learning, causal graph-based diagnosis, and counterfactual assessment in an agentic LLM pipeline to improve accuracy and explanation faithfulness.

  49. Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection

    cs.CV 2026-04 unverdicted novelty 5.0

    A new adapter module combining boundary-aware state space modeling with spatial processing boosts localization and robustness in temporal action detection.

  50. The Hyperscale Lottery: How State-Space Models Have Sacrificed Edge Efficiency

    cs.AR 2026-04 unverdicted novelty 4.0

    Mamba-3 architectural changes optimized for hyperscale GPUs cause 28% higher edge latency at 880M parameters and 48% at 15M parameters compared to earlier versions.

Reference graph

Works this paper leans on

120 extracted references · 120 canonical work pages · cited by 50 Pith papers · 19 internal anchors

  1. [1]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints”. In:arXiv preprint arXiv:2305.13245 (2023)

  2. [2]

    Linear Transformers with Learnable Kernel Functions are Better In-Context Models

    Yaroslav Aksenov, Nikita Balagansky, Sofia Maria Lo Cicero Vaina, Boris Shaposhnikov, Alexey Gorbatovski, and Daniil Gavrilov. “Linear Transformers with Learnable Kernel Functions are Better In-Context Models”. In: arXiv preprint arXiv:2402.10644 (2024)

  3. [3]

    In-Context Language Learning: Architectures and Algorithms

    Ekin Akyürek, Bailin Wang, Yoon Kim, and Jacob Andreas. “In-Context Language Learning: Architectures and Algorithms”. In: The International Conference on Machine Learning (ICML) . 2024

  4. [4]

    The hidden attention of mamba models

    Ameen Ali, Itamar Zimerman, and Lior Wolf. The Hidden Attention of Mamba Models . 2024. arXiv: 2403.01590 [cs.LG]

  5. [5]

    Zoology: Measuring and Improving Recall in Efficient Language Models

    Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, and Christo- pher Ré. “Zoology: Measuring and Improving Recall in Efficient Language Models”. In:The International Conference on Learning Representations (ICLR) . 2024

  6. [6]

    Simple Linear Attention Language Models Balance the Recall-Throughput Tradeoff

    Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, and Christopher Ré. “Simple Linear Attention Language Models Balance the Recall-Throughput Tradeoff”. In: The International Conference on Machine Learning (ICML) . 2024

  7. [7]

    Neural Machine Translation by Jointly Learning to Align and Translate

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural Machine Translation by Jointly Learning to Align and Translate”. In: The International Conference on Learning Representations (ICLR) . 2015. 36

  8. [8]

    Pade Approximants: Encyclopedia of Mathematics and It’s Applications, Vol

    George A Baker, George A Baker Jr, Peter Graves-Morris, and Susan S Baker. Pade Approximants: Encyclopedia of Mathematics and It’s Applications, Vol. 59 George A. Baker, Jr., Peter Graves-Morris . Vol. 59. Cambridge University Press, 1996

  9. [9]

    xlstm: Extended long short-term memory

    Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Gün- ter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. “xLSTM: Extended Long Short-Term Memory”. In: arXiv preprint arXiv:2405.04517 (2024)

  10. [10]

    Pythia: A Suite for Analyzing Large Language Models across Training and Scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mo- hammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. “Pythia: A Suite for Analyzing Large Language Models across Training and Scaling”. In:The International Conference on Machine Learning (ICML). PMLR. 2023, pp. 2397–2430

  11. [11]

    PIQA: Reasoning about Physical Commonsense in Natural Language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. “PIQA: Reasoning about Physical Commonsense in Natural Language”. In: Proceedings of the AAAI conference on Artificial Intelligence . Vol. 34. 2020

  12. [12]

    S., Purohit, S., Reynolds, L., Tow, J., Wang, B., and Weinbach, S

    Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. “Gpt-NeoX-20B: An Open-source Autoregressive Language Model”. In: arXiv preprint arXiv:2204.06745 (2022)

  13. [13]

    Prefix Sums and Their Applications

    Guy E Blelloch. “Prefix Sums and Their Applications”. In: (1990)

  14. [14]

    RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

    Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, et al. “RecurrentGemma: Moving Past Transform- ers for Efficient Open Language Models”. In:arXiv preprint arXiv:2404.07839 (2024)

  15. [15]

    Time Series Analysis: Forecasting and Control

    George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. Time Series Analysis: Forecasting and Control. John Wiley & Sons, 2015

  16. [16]

    Quasi-recurrent Neural Networks

    James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. “Quasi-recurrent Neural Networks”. In: arXiv preprint arXiv:1611.01576 (2016)

  17. [17]

    Striped attention: Faster ring attention for causal transformers

    William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan- Kelley. “Striped attention: Faster ring attention for causal transformers”. In:arXiv preprint arXiv:2311.09431 (2023)

  18. [18]

    Language Models are Few-shot Learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language Models are Few-shot Learners”. In: Advances in Neural Information Processing Systems (NeurIPS) 33 (2020), pp. 1877–1901

  19. [19]

    Rethinking Attention with Performers

    Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. “Rethinking Attention with Performers”. In: The International Conference on Learning Representations (ICLR) . 2021

  20. [20]

    PaLM: Scaling Language Modeling with Path- ways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. “PaLM: Scaling Language Modeling with Path- ways”. In:Journal of Machine Learning Research24.240 (2023), pp. 1–113.url: http://jmlr.org/papers/v24/22- 1144.html

  21. [21]

    Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

    Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. “Empirical Evaluation of Gated Recur- rent Neural Networks on Sequence Modeling”. In: arXiv preprint arXiv:1412.3555 (2014)

  22. [22]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. In:arXiv preprint arXiv:1803.05457 (2018)

  23. [23]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning”. In: The International Conference on Learning Representations (ICLR) . 2024

  24. [24]

    Monarch: Expressive structured matrices for efficient and accurate training

    Tri Dao, Beidi Chen, Nimit S Sohoni, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, and Christopher Ré. “Monarch: Expressive structured matrices for efficient and accurate training”. In: International Conference on Machine Learning . PMLR. 2022, pp. 4690–4721

  25. [25]

    Hungry Hungry Hippos: Towards Language Modeling with State Space Models

    Tri Dao, Daniel Y Fu, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher Ré. “Hungry Hungry Hippos: Towards Language Modeling with State Space Models”. In:The International Conference on Learning Representations (ICLR). 2023

  26. [26]

    Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations

    Tri Dao, Albert Gu, Matthew Eichhorn, Atri Rudra, and Christopher Ré. “Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations”. In:The International Conference on Machine Learning (ICML) . 2019

  27. [27]

    Kaleidoscope: An Efficient, Learnable Representation for All Structured Linear Maps

    Tri Dao, Nimit Sohoni, Albert Gu, Matthew Eichhorn, Amit Blonder, Megan Leszczynski, Atri Rudra, and Christo- pher Ré. “Kaleidoscope: An Efficient, Learnable Representation for All Structured Linear Maps”. In: The Interna- tional Conference on Learning Representations (ICLR) . 2020. 37

  28. [28]

    Vision Transformers Need Registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. “Vision Transformers Need Registers”. In: The International Conference on Learning Representations (ICLR) . 2024

  29. [29]

    Griffin: Mixing gated linear recurrences with local attention for efficient language models

    Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. “Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models”. In:arXiv preprint arXiv:2402.19427 (2024)

  30. [30]

    A Two-Pronged Progress in Structured Dense Matrix Vector Multiplication

    Christopher De Sa, Albert Gu, Rohan Puttagunta, Christopher Ré, and Atri Rudra. “A Two-Pronged Progress in Structured Dense Matrix Vector Multiplication”. In:Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM. 2018, pp. 1060–1079

  31. [31]

    arXiv preprint arXiv:2404.10830 , year=

    Hantian Ding, Zijian Wang, Giovanni Paolini, Varun Kumar, Anoop Deoras, Dan Roth, and Stefano Soatto. “Fewer truncations improve language modeling”. In: arXiv preprint arXiv:2404.10830 (2024)

  32. [32]

    On a new class of structured matrices

    Yuli Eidelman and Israel Gohberg. “On a new class of structured matrices”. In: Integral Equations and Operator Theory 34.3 (1999), pp. 293–324

  33. [33]

    Monarch mixer: A simple sub-quadratic gemm-based architecture

    Dan Fu, Simran Arora, Jessica Grogan, Isys Johnson, Evan Sabri Eyuboglu, Armin Thomas, Benjamin Spector, Michael Poli, Atri Rudra, and Christopher Ré. “Monarch mixer: A simple sub-quadratic gemm-based architecture”. In: Advances in Neural Information Processing Systems 36 (2024)

  34. [34]

    Simple Hardware-efficient Long Convolutions for Sequence Modeling

    Daniel Y Fu, Elliot L Epstein, Eric Nguyen, Armin W Thomas, Michael Zhang, Tri Dao, Atri Rudra, and Christopher Ré. “Simple Hardware-efficient Long Convolutions for Sequence Modeling”. In: The International Conference on Machine Learning (ICML) (2023)

  35. [35]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. “The Pile: An 800GB Dataset of Diverse Text for Language Modeling”. In: arXiv preprint arXiv:2101.00027 (2020)

  36. [36]

    12 Published as a conference paper at ICLR 2022 Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan

    Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A Framework for Few-shot Language Model Evaluation . Version v0.0.1. Sept. 2021. doi: 10. 5281/zenodo.5371628. url: htt...

  37. [37]

    Zamba: A compact 7b SSM.arXiv preprint arXiv:2405.16712,

    Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. “Zamba: A Compact 7B SSM Hybrid Model”. In:arXiv preprint arXiv:2405.16712 (2024)

  38. [38]

    Is Mamba Capable of In-Context Learning?

    Riccardo Grazzi, Julien Siems, Simon Schrodi, Thomas Brox, and Frank Hutter. “Is Mamba Capable of In-Context Learning?” In: arXiv preprint arXiv:2402.03170 (2024)

  39. [39]

    Modeling Sequences with Structured State Spaces

    Albert Gu. “Modeling Sequences with Structured State Spaces”. PhD thesis. Stanford University, 2023

  40. [40]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. “Mamba: Linear-Time Sequence Modeling with Selective State Spaces”. In: arXiv preprint arXiv:2312.00752 (2023)

  41. [41]

    HIPPO: Recurrent Memory with Optimal Polynomial Projections

    Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. “HIPPO: Recurrent Memory with Optimal Polynomial Projections”. In: Advances in Neural Information Processing Systems (NeurIPS) . 2020

  42. [42]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Albert Gu, Karan Goel, and Christopher Ré. “Efficiently Modeling Long Sequences with Structured State Spaces”. In: The International Conference on Learning Representations (ICLR) . 2022

  43. [43]

    On the Parameterization and Initialization of Diagonal State Space Models

    Albert Gu, Ankit Gupta, Karan Goel, and Christopher Ré. “On the Parameterization and Initialization of Diagonal State Space Models”. In: Advances in Neural Information Processing Systems (NeurIPS) . 2022

  44. [44]

    Combining Recurrent, Convolutional, and Continuous-time Models with the Linear State Space Layer

    Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. “Combining Recurrent, Convolutional, and Continuous-time Models with the Linear State Space Layer”. In:Advances in Neural Information Processing Systems (NeurIPS). 2021

  45. [45]

    How to Train Your HIPPO: State Space Models with Generalized Basis Projections

    Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, and Christopher Ré. “How to Train Your HIPPO: State Space Models with Generalized Basis Projections”. In: The International Conference on Learning Representations (ICLR) . 2023

  46. [46]

    Diagonal State Spaces are as Effective as Structured State Spaces

    Ankit Gupta, Albert Gu, and Jonathan Berant. “Diagonal State Spaces are as Effective as Structured State Spaces”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 22982–22994

  47. [47]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. “Gaussian Error Linear Units (GELUs)”. In: arXiv preprint arXiv:1606.08415 (2016)

  48. [48]

    Data Parallel Algorithms

    W Daniel Hillis and Guy L Steele Jr. “Data Parallel Algorithms”. In: Communications of the ACM 29.12 (1986), pp. 1170–1183

  49. [49]

    Long Short-Term Memory

    Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory”. In:Neural Computation 9.8 (1997), pp. 1735– 1780

  50. [50]

    An Empirical Analysis of Compute- 38 Optimal Large Language Model Training

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. “An Empirical Analysis of Compute- 38 Optimal Large Language Model Training”. In: Advances in Neural Information Processing Systems (NeurIPS) 35 (2022), pp. 30016–30030

  51. [51]

    Repeat After Me: Transformers Are Better Than State Space Models at Copying

    Samy Jelassi, David Brandfonbrener, Sham M Kakade, and Eran Malach. “Repeat After Me: Transformers Are Better Than State Space Models at Copying”. In: The International Conference on Machine Learning (ICML) . 2024

  52. [52]

    Transformers are RNNs: Fast Au- toregressive Transformers with Linear Attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. “Transformers are RNNs: Fast Au- toregressive Transformers with Linear Attention”. In:International Conference on Machine Learning . PMLR. 2020, pp. 5156–5165

  53. [53]

    Gateloop: Fully data-controlled linear recurrence for sequence modeling

    Tobias Katsch. “GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling”. In: arXiv preprint arXiv:2311.01927 (2023)

  54. [54]

    Linear Dynamical Systems as a Core Computational Primitive

    Shiva Kaul. “Linear Dynamical Systems as a Core Computational Primitive”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 16808–16820

  55. [55]

    Reducing activation recomputation in large transformer models

    Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. “Reducing activation recomputation in large transformer models”. In:Proceedings of Machine Learning and Systems 5 (2023)

  56. [56]

    Fnet: Mixing tokens with fourier transforms.arXiv preprint arXiv:2105.03824,

    James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, and Santiago Ontanon. “Fnet: Mixing tokens with fourier trans- forms”. In: arXiv preprint arXiv:2105.03824 (2021)

  57. [57]

    When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute

    Tao Lei. “When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute”. In: Pro- ceedings of the 2021 Conference on Empirical Methods in Natural Language Processing . 2021, pp. 7633–7648

  58. [58]

    Simple Recurrent Units for Highly Parallelizable Recur- rence

    Tao Lei, Yu Zhang, Sida I Wang, Hui Dai, and Yoav Artzi. “Simple Recurrent Units for Highly Parallelizable Recur- rence”. In:arXiv preprint arXiv:1709.02755 (2017)

  59. [59]

    What Makes Convolutional Models Great on Long Sequence Modeling?

    Yuhong Li, Tianle Cai, Yi Zhang, Deming Chen, and Debadeepta Dey. “What Makes Convolutional Models Great on Long Sequence Modeling?” In: The International Conference on Learning Representations (ICLR) . 2023

  60. [60]

    Jamba: A Hybrid Transformer-Mamba Language Model

    Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. “Jamba: A Hybrid Transformer-Mamba Language Model”. In:arXiv preprint arXiv:2403.19887 (2024)

  61. [61]

    World model on million-length video and language with ringattention.arXiv preprint arXiv:2402.08268, 2024

    Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. “World Model on Million-Length Video And Language With RingAttention”. In: arXiv preprint arXiv:2402.08268 (2024)

  62. [62]

    Ring Attention with Blockwise Transformers for Near-Infinite Context

    Hao Liu, Matei Zaharia, and Pieter Abbeel. “Ring attention with blockwise transformers for near-infinite context”. In: arXiv preprint arXiv:2310.01889 (2023)

  63. [63]

    Structured State Space Models for In-Context Reinforcement Learning

    Chris Lu, Yannick Schroecker, Albert Gu, Emilio Parisotto, Jakob Foerster, Satinder Singh, and Feryal Behbahani. “Structured State Space Models for In-Context Reinforcement Learning”. In: Advances in Neural Information Pro- cessing Systems (NeurIPS). 2023

  64. [64]

    Mega: Moving Average Equipped Gated Attention

    Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. “Mega: Moving Average Equipped Gated Attention”. In: The International Conference on Learning Representations (ICLR). 2023

  65. [65]

    Parallelizing Linear Recurrent Neural Nets Over Sequence Length

    Eric Martin and Chris Cundy. “Parallelizing Linear Recurrent Neural Nets Over Sequence Length”. In: The Inter- national Conference on Learning Representations (ICLR) . 2018

  66. [66]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. “Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering”. In: arXiv preprint arXiv:1809.02789 (2018)

  67. [67]

    In-context Learning and Induction Heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

  68. [68]

    Resurrecting Recurrent Neural Networks for Long Sequences

    Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. “Resurrecting Recurrent Neural Networks for Long Sequences”. In: The International Conference on Machine Learning (ICML). 2023

  69. [69]

    The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context

    Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc-Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. “The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context”. In:Proceedings of the 54th Annual Meeting of the Association for Computational Linguis- tics. 2016, pp. 1525–1534

  70. [70]

    Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks

    Jongho Park, Jaeseung Park, Zheyang Xiong, Nayoung Lee, Jaewoong Cho, Samet Oymak, Kangwook Lee, and Dimitris Papailiopoulos. “Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks”. In: The International Conference on Machine Learning (ICML) . 2024. 39

  71. [71]

    RWKV: Reinventing RNNs for the Transformer Era

    Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, et al. “RWKV: Reinventing RNNs for the Transformer Era”. In: arXiv preprint arXiv:2305.13048 (2023)

  72. [72]

    Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence

    Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Teddy Ferdinan, Haowen Hou, Przemysław Kazienko, et al. “Eagle and Finch: RWKV with matrix-valued states and dy- namic recurrence”. In:arXiv preprint arXiv:2404.05892 (2024)

  73. [73]

    Random Feature Attention

    Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A Smith, and Lingpeng Kong. “Random Feature Attention”. In: The International Conference on Learning Representations (ICLR) . 2021

  74. [74]

    Computing with Quasiseparable Matrices

    Clément Pernet. “Computing with Quasiseparable Matrices”. In: Proceedings of the ACM on International Sympo- sium on Symbolic and Algebraic Computation . 2016, pp. 389–396

  75. [75]

    Exact computations with quasiseparable matrices

    Clément Pernet, Hippolyte Signargout, and Gilles Villard. “Exact computations with quasiseparable matrices”. In: arXiv preprint arXiv:2302.04515 (2023)

  76. [76]

    Time and space efficient generators for quasiseparable matrices

    Clément Pernet and Arne Storjohann. “Time and space efficient generators for quasiseparable matrices”. In:Journal of Symbolic Computation 85 (2018), pp. 224–246

  77. [77]

    Hyena Hierarchy: Towards Larger Convolutional Language Models

    Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. “Hyena Hierarchy: Towards Larger Convolutional Language Models”. In: The International Conference on Machine Learning (ICML) . 2023

  78. [78]

    Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum

    Hadi Pouransari, Chun-Liang Li, Jen-Hao Rick Chang, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, and Oncel Tuzel. “Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum”. In: arXiv preprint arXiv:2405.13226 (2024)

  79. [79]

    Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

    Ofir Press, Noah Smith, and Mike Lewis. “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation”. In: International Conference on Learning Representations . 2022

  80. [80]

    Toeplitz Neural Network for Sequence Modeling

    Zhen Qin, Xiaodong Han, Weixuan Sun, Bowen He, Dong Li, Dongxu Li, Yuchao Dai, Lingpeng Kong, and Yiran Zhong. “Toeplitz Neural Network for Sequence Modeling”. In:The International Conference on Learning Represen- tations (ICLR). 2023

Showing first 80 references.