arxiv: 2405.21060 · v1 · submitted 2024-05-31 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Albert Gu, Tri Dao

Authors on Pith no claims yet

Pith reviewed 2026-05-11 12:10 UTC · model grok-4.3

classification 💻 cs.LG

keywords state space modelstransformersmambasemiseparable matricesattentionefficient algorithmslanguage modelingduality

0 comments

The pith

Transformers and state-space models share a common structure through decompositions of semiseparable matrices, allowing a faster Mamba-2 model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Transformers and SSMs such as Mamba are linked by viewing their computations as different decompositions of the same structured semiseparable matrices. This state space duality framework reveals that attention is a special case of an SSM and supplies new algorithms for both. The authors apply the connection to create Mamba-2, a refined selective SSM whose core layer runs 2-8 times faster than the original Mamba while staying competitive with Transformers on language modeling. A sympathetic reader would care because the result collapses the apparent opposition between recurrent and attention-based architectures into interchangeable views of one matrix family. If the claim holds, model designers gain a single set of tools for trading off speed, memory, and expressivity without starting from scratch.

Core claim

Transformers and SSMs are closely related, connected through various decompositions of structured semiseparable matrices. The state space duality framework lets us design Mamba-2, whose core layer refines Mamba's selective SSM to be 2-8X faster while remaining competitive with Transformers on language modeling.

What carries the argument

State space duality (SSD) framework, which equates variants of attention and selective SSMs through structured semiseparable matrix decompositions.

If this is right

Mamba-2 achieves 2-8X faster inference and training than the prior Mamba selective SSM.
Mamba-2 maintains competitive performance with Transformers on language modeling tasks.
The duality supplies efficient algorithms for both SSMs and attention variants.
New architectures can be built by choosing different decompositions within the same semiseparable matrix family.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The shared matrix view suggests hardware kernels written for one architecture can be reused for the other with only a change of decomposition.
Results from linear algebra on semiseparable matrices, such as fast inversion or low-rank updates, could be ported directly to improve long-sequence scaling in either model family.
Hybrid layers that switch between attention-style and SSM-style decompositions within a single network become a natural design option rather than an ad-hoc combination.

Load-bearing premise

The decompositions of structured semiseparable matrices preserve the modeling capacity and training dynamics of the original selective SSM.

What would settle it

Running Mamba-2 on standard language modeling benchmarks and finding either no measurable speedup over Mamba or a clear drop in perplexity relative to both Mamba and Transformers would show the decompositions fail to deliver the claimed benefits in practice.

read the original abstract

While Transformers have been the main architecture behind deep learning's success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale. We show that these families of models are actually quite closely related, and develop a rich framework of theoretical connections between SSMs and variants of attention, connected through various decompositions of a well-studied class of structured semiseparable matrices. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba's selective SSM that is 2-8X faster, while continuing to be competitive with Transformers on language modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper links selective SSMs to attention through semiseparable matrix decompositions and uses the link to build Mamba-2, a faster refinement that stays competitive on language tasks.

read the letter

The main point is that Dao and Gu connect selective state space models to Transformer attention by framing both as operations on structured semiseparable matrices. This duality lets them refine Mamba into Mamba-2, which delivers 2 to 8 times faster inference and training while staying competitive on language modeling benchmarks. They do well by making the theoretical link explicit and using it to design better algorithms. The SSD framework derives the new layer from matrix decompositions rather than starting from the recurrence and adding tricks. This gives a principled way to get linear complexity with input-dependent parameters. The empirical section shows real speed improvements on GPUs and maintains performance parity with larger Transformers on tasks like perplexity. One area to watch is whether the decompositions keep the selective dynamics exact. The stress test note is right to flag that input-dependent A, B, C break translation invariance, so the factorization must not introduce approximations. The paper asserts algebraic equivalence, but without seeing the full error bounds or a verification on synthetic long sequences, it's hard to be fully confident that capacity is preserved for all cases. The ablations help, but more on edge cases would be good. This paper targets people working on sequence modeling efficiency. A reader who wants to implement fast SSMs or explore hybrids with attention will find concrete tools and ideas here. It shows clear thinking on the math and has reproducible claims, so it deserves a serious referee. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper establishes a theoretical framework called State Space Duality (SSD) that unifies Transformers and state space models (SSMs) like Mamba by showing connections through decompositions of structured semiseparable matrices. It proposes Mamba-2, a new architecture based on a refined selective SSM that achieves significant speedups (2-8X) over previous models while remaining competitive with Transformers on language modeling benchmarks.

Significance. If the SSD framework provides exact equivalences and the proposed algorithms deliver the claimed efficiency gains without sacrificing modeling capacity, this work has the potential to advance the field by offering a unified view of attention and SSMs, leading to more efficient and scalable sequence models. The development of generalized models and efficient algorithms is a strength, particularly if supported by rigorous derivations.

major comments (2)

[§3] §3 (SSD framework and matrix decompositions): The claim that SSD yields an equivalent selective SSM layer must be shown to hold exactly for input-dependent A/B/C matrices. The manuscript should provide a formal derivation or proof that the structured semiseparable factorization introduces no hidden low-rank or block-diagonal approximations, as any such assumption would risk altering long-range, input-dependent recall dynamics.
[§5] §5 (Mamba-2 architecture and experiments): To substantiate that modeling capacity is preserved, include direct comparisons of Mamba-2 against the original selective SSM on tasks emphasizing long-context input-dependent memory, with ablations isolating the SSD algorithm from implementation optimizations to confirm the reported 2-8X speedups.

minor comments (2)

[Abstract] Abstract: The phrasing 'a refinement of Mamba's selective SSM' is vague; specify the precise modifications to the core layer.
[Notation] Notation and figures: Ensure uniform notation for semiseparable matrices and improve clarity of any matrix decomposition diagrams.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and recognition of the potential impact of the SSD framework and Mamba-2. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (SSD framework and matrix decompositions): The claim that SSD yields an equivalent selective SSM layer must be shown to hold exactly for input-dependent A/B/C matrices. The manuscript should provide a formal derivation or proof that the structured semiseparable factorization introduces no hidden low-rank or block-diagonal approximations, as any such assumption would risk altering long-range, input-dependent recall dynamics.

Authors: We thank the referee for this important clarification request. Section 3 derives the SSD framework by expressing the selective SSM recurrence as a structured semiseparable matrix and showing its duality to an attention-like form. The construction incorporates input-dependent A, B, and C directly into the diagonal blocks and low-rank factors of the semiseparable decomposition, preserving the exact recurrence without additional low-rank or block-diagonal approximations. To address the concern rigorously, we will add a formal proof in the appendix of the revised manuscript that verifies the equivalence holds exactly for arbitrary input-dependent parameters, with explicit steps showing that no hidden assumptions are introduced that would alter long-range dynamics. revision: yes
Referee: [§5] §5 (Mamba-2 architecture and experiments): To substantiate that modeling capacity is preserved, include direct comparisons of Mamba-2 against the original selective SSM on tasks emphasizing long-context input-dependent memory, with ablations isolating the SSD algorithm from implementation optimizations to confirm the reported 2-8X speedups.

Authors: We agree that targeted experiments would further substantiate the preservation of modeling capacity. The current manuscript shows Mamba-2 remains competitive with Transformers on language modeling while delivering the reported speedups over prior SSMs. In the revision, we will add direct comparisons of Mamba-2 against the original selective SSM (Mamba) on long-context tasks focused on input-dependent memory, such as associative recall and long-range dependency benchmarks. We will also include ablations that isolate the SSD algorithmic improvements from low-level implementation optimizations, thereby confirming that the 2-8X speedups arise from the structured duality rather than engineering alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on independent matrix decompositions

full rationale

The paper establishes connections between SSMs and attention variants by decomposing structured semiseparable matrices, then uses this SSD framework to refine the selective SSM into Mamba-2 with claimed speedups. No step equates a prediction to its fitted input, renames a known result as novel unification, or reduces the central claim to a self-citation chain. Prior Mamba work is cited for context on selectivity, but the duality, decompositions, and algorithm derivations are presented as self-contained linear-algebra results with explicit matrix constructions that do not presuppose the target architecture or performance claims. The framework remains falsifiable via direct implementation and benchmarking against the original recurrence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the framework rests on the mathematical properties of structured semiseparable matrices, which are treated as standard background.

pith-pipeline@v0.9.0 · 5413 in / 952 out tokens · 30947 ms · 2026-05-11T12:10:34.468343+00:00 · methodology

discussion (0)

Forward citations

Cited by 50 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Selection, Not Fusion: Radar-Modulated State Space Models for Radar-Camera Depth Estimation
cs.CV 2026-05 unverdicted novelty 7.0

Radar-Modulated Selection perturbs only the step size Δ and readout C parameters inside Mamba's selective scan with radar data while keeping other components image-only, yielding state-of-the-art depth estimation on n...
TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles
cs.CV 2026-05 unverdicted novelty 7.0

TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
TIDES: Implicit Time-Awareness in Selective State Space Models
cs.LG 2026-05 unverdicted novelty 7.0

TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and P...
FRACTAL: SSM with Fractional Recurrent Architecture for Computational Temporal Analysis of Long Sequences
cs.AI 2026-05 unverdicted novelty 7.0

FRACTAL integrates fractional recurrent architecture into SSMs using a tunable singularity index to capture multi-scale temporal features, reporting 87.11% average on Long Range Arena and outperforming S5.
Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control
cs.LG 2026-05 unverdicted novelty 7.0

Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
cs.LG 2026-05 unverdicted novelty 7.0

PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
Rethink MAE with Linear Time-Invariant Dynamics
cs.CV 2026-04 unverdicted novelty 7.0

Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
cs.LG 2026-04 unverdicted novelty 7.0

Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
The UNDO Flip-Flop: A Controlled Probe for Reversible Semantic State Management in State Space Model
cs.LG 2026-04 unverdicted novelty 7.0

Mamba-2 models fail to learn reversible state retrieval in the UNDO Flip-Flop task, defaulting to a toggle heuristic and achieving only 41% accuracy under adversarial conditions.
S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
cs.CL 2026-04 conditional novelty 7.0

S0 tuning optimizes initial recurrent states in hybrid models to outperform LoRA with zero inference cost on HumanEval and partial cross-domain transfer.
The Randomness Floor: Measuring Intrinsic Non-Randomness in Language Model Token Distributions
cs.CL 2026-03 unverdicted novelty 7.0

Language models have an intrinsic randomness floor: transformers show ~0.30 entropic deviation from uniform on neutral prompts, accounting for 88-93% of observed non-randomness, while state-space models exhibit twice ...
Freeze Deep, Train Shallow: Interpretable Layer Allocation for Continued Pre-Training
cs.CL 2026-05 unverdicted novelty 6.0

Freezing deep layers and training shallow layers during continued pre-training of LLMs outperforms full fine-tuning and the opposite allocation on C-Eval and CMMLU, guided by a new layer-sensitivity diagnostic.
MambaNetBurst: Direct Byte-level Network Traffic Classification without Tokenization or Pretraining
cs.CR 2026-05 unverdicted novelty 6.0

A compact Mamba-2 model performs end-to-end byte-level network traffic classification without tokenization or pre-training and remains competitive with substantially larger pre-trained systems.
RT-Transformer: The Transformer Block as a Spherical State Estimator
cs.LG 2026-05 unverdicted novelty 6.0

Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
cs.CL 2026-05 unverdicted novelty 6.0

Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators
cs.LG 2026-05 unverdicted novelty 6.0

Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.
Cubit: Token Mixer with Kernel Ridge Regression
cs.LG 2026-05 unverdicted novelty 6.0

Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.
The Impossibility Triangle of Long-Context Modeling
cs.CL 2026-05 unverdicted novelty 6.0

No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
cs.CL 2026-04 unverdicted novelty 6.0

HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
NAKUL-Med: Spectral-Graph State Space Models with Dynamics Kernels for Medical Signals
eess.SP 2026-04 unverdicted novelty 6.0

NAKUL achieves 91.7% accuracy on motor imagery EEG with 28% fewer parameters than EEG-Conformer by using dynamic kernel generation, spectral context modeling, and graph-guided spatial attention.
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models
cs.LG 2026-04 unverdicted novelty 6.0

HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
M$^{2}$GRPO: Mamba-based Multi-Agent Group Relative Policy Optimization for Biomimetic Underwater Robots Pursuit
cs.RO 2026-04 unverdicted novelty 6.0

M²GRPO uses a Mamba-based policy and normalized group-relative advantages under CTDE to achieve higher pursuit success and capture efficiency than MAPPO and recurrent baselines in simulations and pool tests.
Sonata: A Hybrid World Model for Inertial Kinematics under Clinical Data Scarcity
cs.LG 2026-04 unverdicted novelty 6.0

Sonata is a small hybrid world model pre-trained to predict future IMU states that outperforms autoregressive baselines on clinical discrimination, fall-risk prediction, and cross-cohort transfer while fitting on-devi...
MambaBack: Bridging Local Features and Global Contexts in Whole Slide Image Analysis
cs.CV 2026-04 conditional novelty 6.0

MambaBack is a hybrid Mamba-CNN model with Hilbert sampling and chunked inference that reports better performance than seven prior methods on five whole-slide image datasets.
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
cs.CV 2026-04 unverdicted novelty 6.0

The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
Parcae: Scaling Laws For Stable Looped Language Models
cs.LG 2026-04 unverdicted novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
Optimal Decay Spectra for Linear Recurrences
cs.LG 2026-04 unverdicted novelty 6.0

PoST reparameterizes decay spectra in linear recurrences with geometric log-spacing and position-adaptive scaling to achieve O(exp(-cN/log t)) decay, improving zero-shot language modeling and long-context retrieval ac...
In-Place Test-Time Training
cs.LG 2026-04 conditional novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space
cs.CL 2026-04 unverdicted novelty 6.0

PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing
cs.LG 2026-04 unverdicted novelty 6.0

Stochastic training with random cross-layer KV attention enables depth-wise cache sharing in transformers, cutting memory footprint while preserving or improving performance.
Attention to Mamba: A Recipe for Cross-Architecture Distillation
cs.CL 2026-04 unverdicted novelty 6.0

A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.
Computer Architecture's AlphaZero Moment: Automated Discovery in an Encircled World
cs.AR 2026-03 conditional novelty 6.0

Automated architectural discovery engines can outperform human design teams by exploring massive design spaces and compressing development cycles from months to weeks.
Kimi Linear: An Expressive, Efficient Attention Architecture
cs.CL 2025-10 unverdicted novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
cs.CL 2025-06 unverdicted novelty 6.0

MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
Titans: Learning to Memorize at Test Time
cs.LG 2024-12 unverdicted novelty 6.0

Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
cs.CL 2026-05 unverdicted novelty 5.0

Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
Kaczmarz Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving
cs.DC 2026-05 unverdicted novelty 5.0

Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.
Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models
cs.LG 2026-04 unverdicted novelty 5.0

Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.
Reasoning Primitives in Hybrid and Non-Hybrid LLMs
cs.CL 2026-04 unverdicted novelty 5.0

Reasoning augmentation extends the difficulty range for both architectures, but hybrid models stay robust longer than transformers as sequential dependence increases in state-based recall tasks.
LayerTracer: A Joint Task-Particle and Vulnerable-Layer Analysis framework for Arbitrary Large Language Model Architectures
cs.CL 2026-04 unverdicted novelty 5.0

LayerTracer defines task particles as the first layer where target token probability rises sharply and vulnerable layers via maximum JS divergence after masking, showing task particles in deep layers and greater robus...
FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control
cs.LG 2026-04 unverdicted novelty 5.0

FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.
Sessa: Selective State Space Attention
cs.LG 2026-04 unverdicted novelty 5.0

Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.
Hypergraph-State Collaborative Reasoning for Multi-Object Tracking
cs.CV 2026-04 unverdicted novelty 5.0

HyperSSM integrates hypergraphs and state space models to let correlated objects mutually refine motion estimates, stabilizing trajectories under noise and occlusion for state-of-the-art multi-object tracking.
COREY: Entropy-Guided Runtime Chunk Scheduling for Selective Scan Kernels
cs.CV 2026-04 unverdicted novelty 5.0

COREY maps activation entropy to chunk sizes for SSM kernels, matching static-oracle latency at kernel level with 3.9-4.4x speedups over baselines but adding overhead that prevents end-to-end gains while preserving ex...
CARE-ECG: Causal Agent-based Reasoning for Explainable and Counterfactual ECG Interpretation
cs.LG 2026-04 unverdicted novelty 5.0

CARE-ECG unifies ECG representation learning, causal graph-based diagnosis, and counterfactual assessment in an agentic LLM pipeline to improve accuracy and explanation faithfulness.
Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection
cs.CV 2026-04 unverdicted novelty 5.0

A new adapter module combining boundary-aware state space modeling with spatial processing boosts localization and robustness in temporal action detection.
The Hyperscale Lottery: How State-Space Models Have Sacrificed Edge Efficiency
cs.AR 2026-04 unverdicted novelty 4.0

Mamba-3 architectural changes optimized for hyperscale GPUs cause 28% higher edge latency at 880M parameters and 48% at 15M parameters compared to earlier versions.

Reference graph

Works this paper leans on

120 extracted references · 120 canonical work pages · cited by 50 Pith papers · 19 internal anchors

[1]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints”. In:arXiv preprint arXiv:2305.13245 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Linear Transformers with Learnable Kernel Functions are Better In-Context Models

Yaroslav Aksenov, Nikita Balagansky, Sofia Maria Lo Cicero Vaina, Boris Shaposhnikov, Alexey Gorbatovski, and Daniil Gavrilov. “Linear Transformers with Learnable Kernel Functions are Better In-Context Models”. In: arXiv preprint arXiv:2402.10644 (2024)

work page arXiv 2024
[3]

In-Context Language Learning: Architectures and Algorithms

Ekin Akyürek, Bailin Wang, Yoon Kim, and Jacob Andreas. “In-Context Language Learning: Architectures and Algorithms”. In: The International Conference on Machine Learning (ICML) . 2024

work page 2024
[4]

The hidden attention of mamba models

Ameen Ali, Itamar Zimerman, and Lior Wolf. The Hidden Attention of Mamba Models . 2024. arXiv: 2403.01590 [cs.LG]

work page arXiv 2024
[5]

Zoology: Measuring and Improving Recall in Efficient Language Models

Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, and Christo- pher Ré. “Zoology: Measuring and Improving Recall in Efficient Language Models”. In:The International Conference on Learning Representations (ICLR) . 2024

work page 2024
[6]

Simple Linear Attention Language Models Balance the Recall-Throughput Tradeoff

Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, and Christopher Ré. “Simple Linear Attention Language Models Balance the Recall-Throughput Tradeoff”. In: The International Conference on Machine Learning (ICML) . 2024

work page 2024
[7]

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural Machine Translation by Jointly Learning to Align and Translate”. In: The International Conference on Learning Representations (ICLR) . 2015. 36

work page 2015
[8]

Pade Approximants: Encyclopedia of Mathematics and It’s Applications, Vol

George A Baker, George A Baker Jr, Peter Graves-Morris, and Susan S Baker. Pade Approximants: Encyclopedia of Mathematics and It’s Applications, Vol. 59 George A. Baker, Jr., Peter Graves-Morris . Vol. 59. Cambridge University Press, 1996

work page 1996
[9]

xlstm: Extended long short-term memory

Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Gün- ter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. “xLSTM: Extended Long Short-Term Memory”. In: arXiv preprint arXiv:2405.04517 (2024)

work page arXiv 2024
[10]

Pythia: A Suite for Analyzing Large Language Models across Training and Scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mo- hammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. “Pythia: A Suite for Analyzing Large Language Models across Training and Scaling”. In:The International Conference on Machine Learning (ICML). PMLR. 2023, pp. 2397–2430

work page 2023
[11]

PIQA: Reasoning about Physical Commonsense in Natural Language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. “PIQA: Reasoning about Physical Commonsense in Natural Language”. In: Proceedings of the AAAI conference on Artificial Intelligence . Vol. 34. 2020

work page 2020
[12]

S., Purohit, S., Reynolds, L., Tow, J., Wang, B., and Weinbach, S

Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. “Gpt-NeoX-20B: An Open-source Autoregressive Language Model”. In: arXiv preprint arXiv:2204.06745 (2022)

work page arXiv 2022
[13]

Prefix Sums and Their Applications

Guy E Blelloch. “Prefix Sums and Their Applications”. In: (1990)

work page 1990
[14]

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, et al. “RecurrentGemma: Moving Past Transform- ers for Efficient Open Language Models”. In:arXiv preprint arXiv:2404.07839 (2024)

work page arXiv 2024
[15]

Time Series Analysis: Forecasting and Control

George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. Time Series Analysis: Forecasting and Control. John Wiley & Sons, 2015

work page 2015
[16]

Quasi-recurrent Neural Networks

James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. “Quasi-recurrent Neural Networks”. In: arXiv preprint arXiv:1611.01576 (2016)

work page arXiv 2016
[17]

Striped attention: Faster ring attention for causal transformers

William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan- Kelley. “Striped attention: Faster ring attention for causal transformers”. In:arXiv preprint arXiv:2311.09431 (2023)

work page arXiv 2023
[18]

Language Models are Few-shot Learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language Models are Few-shot Learners”. In: Advances in Neural Information Processing Systems (NeurIPS) 33 (2020), pp. 1877–1901

work page 2020
[19]

Rethinking Attention with Performers

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. “Rethinking Attention with Performers”. In: The International Conference on Learning Representations (ICLR) . 2021

work page 2021
[20]

PaLM: Scaling Language Modeling with Path- ways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. “PaLM: Scaling Language Modeling with Path- ways”. In:Journal of Machine Learning Research24.240 (2023), pp. 1–113.url: http://jmlr.org/papers/v24/22- 1144.html

work page 2023
[21]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. “Empirical Evaluation of Gated Recur- rent Neural Networks on Sequence Modeling”. In: arXiv preprint arXiv:1412.3555 (2014)

work page internal anchor Pith review arXiv 2014
[22]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. In:arXiv preprint arXiv:1803.05457 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[23]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning”. In: The International Conference on Learning Representations (ICLR) . 2024

work page 2024
[24]

Monarch: Expressive structured matrices for efficient and accurate training

Tri Dao, Beidi Chen, Nimit S Sohoni, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, and Christopher Ré. “Monarch: Expressive structured matrices for efficient and accurate training”. In: International Conference on Machine Learning . PMLR. 2022, pp. 4690–4721

work page 2022
[25]

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

Tri Dao, Daniel Y Fu, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher Ré. “Hungry Hungry Hippos: Towards Language Modeling with State Space Models”. In:The International Conference on Learning Representations (ICLR). 2023

work page 2023
[26]

Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations

Tri Dao, Albert Gu, Matthew Eichhorn, Atri Rudra, and Christopher Ré. “Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations”. In:The International Conference on Machine Learning (ICML) . 2019

work page 2019
[27]

Kaleidoscope: An Efficient, Learnable Representation for All Structured Linear Maps

Tri Dao, Nimit Sohoni, Albert Gu, Matthew Eichhorn, Amit Blonder, Megan Leszczynski, Atri Rudra, and Christo- pher Ré. “Kaleidoscope: An Efficient, Learnable Representation for All Structured Linear Maps”. In: The Interna- tional Conference on Learning Representations (ICLR) . 2020. 37

work page 2020
[28]

Vision Transformers Need Registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. “Vision Transformers Need Registers”. In: The International Conference on Learning Representations (ICLR) . 2024

work page 2024
[29]

Griffin: Mixing gated linear recurrences with local attention for efficient language models

Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. “Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models”. In:arXiv preprint arXiv:2402.19427 (2024)

work page arXiv 2024
[30]

A Two-Pronged Progress in Structured Dense Matrix Vector Multiplication

Christopher De Sa, Albert Gu, Rohan Puttagunta, Christopher Ré, and Atri Rudra. “A Two-Pronged Progress in Structured Dense Matrix Vector Multiplication”. In:Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM. 2018, pp. 1060–1079

work page 2018
[31]

arXiv preprint arXiv:2404.10830 , year=

Hantian Ding, Zijian Wang, Giovanni Paolini, Varun Kumar, Anoop Deoras, Dan Roth, and Stefano Soatto. “Fewer truncations improve language modeling”. In: arXiv preprint arXiv:2404.10830 (2024)

work page arXiv 2024
[32]

On a new class of structured matrices

Yuli Eidelman and Israel Gohberg. “On a new class of structured matrices”. In: Integral Equations and Operator Theory 34.3 (1999), pp. 293–324

work page 1999
[33]

Monarch mixer: A simple sub-quadratic gemm-based architecture

Dan Fu, Simran Arora, Jessica Grogan, Isys Johnson, Evan Sabri Eyuboglu, Armin Thomas, Benjamin Spector, Michael Poli, Atri Rudra, and Christopher Ré. “Monarch mixer: A simple sub-quadratic gemm-based architecture”. In: Advances in Neural Information Processing Systems 36 (2024)

work page 2024
[34]

Simple Hardware-efficient Long Convolutions for Sequence Modeling

Daniel Y Fu, Elliot L Epstein, Eric Nguyen, Armin W Thomas, Michael Zhang, Tri Dao, Atri Rudra, and Christopher Ré. “Simple Hardware-efficient Long Convolutions for Sequence Modeling”. In: The International Conference on Machine Learning (ICML) (2023)

work page 2023
[35]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. “The Pile: An 800GB Dataset of Diverse Text for Language Modeling”. In: arXiv preprint arXiv:2101.00027 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[36]

12 Published as a conference paper at ICLR 2022 Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan

Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A Framework for Few-shot Language Model Evaluation . Version v0.0.1. Sept. 2021. doi: 10. 5281/zenodo.5371628. url: htt...

work page doi:10.5281/zenodo.5371628 2021
[37]

Zamba: A compact 7b SSM.arXiv preprint arXiv:2405.16712,

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. “Zamba: A Compact 7B SSM Hybrid Model”. In:arXiv preprint arXiv:2405.16712 (2024)

work page arXiv 2024
[38]

Is Mamba Capable of In-Context Learning?

Riccardo Grazzi, Julien Siems, Simon Schrodi, Thomas Brox, and Frank Hutter. “Is Mamba Capable of In-Context Learning?” In: arXiv preprint arXiv:2402.03170 (2024)

work page arXiv 2024
[39]

Modeling Sequences with Structured State Spaces

Albert Gu. “Modeling Sequences with Structured State Spaces”. PhD thesis. Stanford University, 2023

work page 2023
[40]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. “Mamba: Linear-Time Sequence Modeling with Selective State Spaces”. In: arXiv preprint arXiv:2312.00752 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

HIPPO: Recurrent Memory with Optimal Polynomial Projections

Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. “HIPPO: Recurrent Memory with Optimal Polynomial Projections”. In: Advances in Neural Information Processing Systems (NeurIPS) . 2020

work page 2020
[42]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher Ré. “Efficiently Modeling Long Sequences with Structured State Spaces”. In: The International Conference on Learning Representations (ICLR) . 2022

work page 2022
[43]

On the Parameterization and Initialization of Diagonal State Space Models

Albert Gu, Ankit Gupta, Karan Goel, and Christopher Ré. “On the Parameterization and Initialization of Diagonal State Space Models”. In: Advances in Neural Information Processing Systems (NeurIPS) . 2022

work page 2022
[44]

Combining Recurrent, Convolutional, and Continuous-time Models with the Linear State Space Layer

Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. “Combining Recurrent, Convolutional, and Continuous-time Models with the Linear State Space Layer”. In:Advances in Neural Information Processing Systems (NeurIPS). 2021

work page 2021
[45]

How to Train Your HIPPO: State Space Models with Generalized Basis Projections

Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, and Christopher Ré. “How to Train Your HIPPO: State Space Models with Generalized Basis Projections”. In: The International Conference on Learning Representations (ICLR) . 2023

work page 2023
[46]

Diagonal State Spaces are as Effective as Structured State Spaces

Ankit Gupta, Albert Gu, and Jonathan Berant. “Diagonal State Spaces are as Effective as Structured State Spaces”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 22982–22994

work page 2022
[47]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. “Gaussian Error Linear Units (GELUs)”. In: arXiv preprint arXiv:1606.08415 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[48]

Data Parallel Algorithms

W Daniel Hillis and Guy L Steele Jr. “Data Parallel Algorithms”. In: Communications of the ACM 29.12 (1986), pp. 1170–1183

work page 1986
[49]

Long Short-Term Memory

Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory”. In:Neural Computation 9.8 (1997), pp. 1735– 1780

work page 1997
[50]

An Empirical Analysis of Compute- 38 Optimal Large Language Model Training

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. “An Empirical Analysis of Compute- 38 Optimal Large Language Model Training”. In: Advances in Neural Information Processing Systems (NeurIPS) 35 (2022), pp. 30016–30030

work page 2022
[51]

Repeat After Me: Transformers Are Better Than State Space Models at Copying

Samy Jelassi, David Brandfonbrener, Sham M Kakade, and Eran Malach. “Repeat After Me: Transformers Are Better Than State Space Models at Copying”. In: The International Conference on Machine Learning (ICML) . 2024

work page 2024
[52]

Transformers are RNNs: Fast Au- toregressive Transformers with Linear Attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. “Transformers are RNNs: Fast Au- toregressive Transformers with Linear Attention”. In:International Conference on Machine Learning . PMLR. 2020, pp. 5156–5165

work page 2020
[53]

Gateloop: Fully data-controlled linear recurrence for sequence modeling

Tobias Katsch. “GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling”. In: arXiv preprint arXiv:2311.01927 (2023)

work page arXiv 2023
[54]

Linear Dynamical Systems as a Core Computational Primitive

Shiva Kaul. “Linear Dynamical Systems as a Core Computational Primitive”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 16808–16820

work page 2020
[55]

Reducing activation recomputation in large transformer models

Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. “Reducing activation recomputation in large transformer models”. In:Proceedings of Machine Learning and Systems 5 (2023)

work page 2023
[56]

Fnet: Mixing tokens with fourier transforms.arXiv preprint arXiv:2105.03824,

James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, and Santiago Ontanon. “Fnet: Mixing tokens with fourier trans- forms”. In: arXiv preprint arXiv:2105.03824 (2021)

work page arXiv 2021
[57]

When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute

Tao Lei. “When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute”. In: Pro- ceedings of the 2021 Conference on Empirical Methods in Natural Language Processing . 2021, pp. 7633–7648

work page 2021
[58]

Simple Recurrent Units for Highly Parallelizable Recur- rence

Tao Lei, Yu Zhang, Sida I Wang, Hui Dai, and Yoav Artzi. “Simple Recurrent Units for Highly Parallelizable Recur- rence”. In:arXiv preprint arXiv:1709.02755 (2017)

work page arXiv 2017
[59]

What Makes Convolutional Models Great on Long Sequence Modeling?

Yuhong Li, Tianle Cai, Yi Zhang, Deming Chen, and Debadeepta Dey. “What Makes Convolutional Models Great on Long Sequence Modeling?” In: The International Conference on Learning Representations (ICLR) . 2023

work page 2023
[60]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. “Jamba: A Hybrid Transformer-Mamba Language Model”. In:arXiv preprint arXiv:2403.19887 (2024)

work page internal anchor Pith review arXiv 2024
[61]

World model on million-length video and language with ringattention.arXiv preprint arXiv:2402.08268, 2024

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. “World Model on Million-Length Video And Language With RingAttention”. In: arXiv preprint arXiv:2402.08268 (2024)

work page arXiv 2024
[62]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Hao Liu, Matei Zaharia, and Pieter Abbeel. “Ring attention with blockwise transformers for near-infinite context”. In: arXiv preprint arXiv:2310.01889 (2023)

work page internal anchor Pith review arXiv 2023
[63]

Structured State Space Models for In-Context Reinforcement Learning

Chris Lu, Yannick Schroecker, Albert Gu, Emilio Parisotto, Jakob Foerster, Satinder Singh, and Feryal Behbahani. “Structured State Space Models for In-Context Reinforcement Learning”. In: Advances in Neural Information Pro- cessing Systems (NeurIPS). 2023

work page 2023
[64]

Mega: Moving Average Equipped Gated Attention

Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. “Mega: Moving Average Equipped Gated Attention”. In: The International Conference on Learning Representations (ICLR). 2023

work page 2023
[65]

Parallelizing Linear Recurrent Neural Nets Over Sequence Length

Eric Martin and Chris Cundy. “Parallelizing Linear Recurrent Neural Nets Over Sequence Length”. In: The Inter- national Conference on Learning Representations (ICLR) . 2018

work page 2018
[66]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. “Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering”. In: arXiv preprint arXiv:1809.02789 (2018)

work page internal anchor Pith review arXiv 2018
[67]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

work page 2022
[68]

Resurrecting Recurrent Neural Networks for Long Sequences

Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. “Resurrecting Recurrent Neural Networks for Long Sequences”. In: The International Conference on Machine Learning (ICML). 2023

work page 2023
[69]

The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc-Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. “The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context”. In:Proceedings of the 54th Annual Meeting of the Association for Computational Linguis- tics. 2016, pp. 1525–1534

work page 2016
[70]

Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks

Jongho Park, Jaeseung Park, Zheyang Xiong, Nayoung Lee, Jaewoong Cho, Samet Oymak, Kangwook Lee, and Dimitris Papailiopoulos. “Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks”. In: The International Conference on Machine Learning (ICML) . 2024. 39

work page 2024
[71]

RWKV: Reinventing RNNs for the Transformer Era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, et al. “RWKV: Reinventing RNNs for the Transformer Era”. In: arXiv preprint arXiv:2305.13048 (2023)

work page internal anchor Pith review arXiv 2023
[72]

Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence

Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Teddy Ferdinan, Haowen Hou, Przemysław Kazienko, et al. “Eagle and Finch: RWKV with matrix-valued states and dy- namic recurrence”. In:arXiv preprint arXiv:2404.05892 (2024)

work page arXiv 2024
[73]

Random Feature Attention

Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A Smith, and Lingpeng Kong. “Random Feature Attention”. In: The International Conference on Learning Representations (ICLR) . 2021

work page 2021
[74]

Computing with Quasiseparable Matrices

Clément Pernet. “Computing with Quasiseparable Matrices”. In: Proceedings of the ACM on International Sympo- sium on Symbolic and Algebraic Computation . 2016, pp. 389–396

work page 2016
[75]

Exact computations with quasiseparable matrices

Clément Pernet, Hippolyte Signargout, and Gilles Villard. “Exact computations with quasiseparable matrices”. In: arXiv preprint arXiv:2302.04515 (2023)

work page arXiv 2023
[76]

Time and space efficient generators for quasiseparable matrices

Clément Pernet and Arne Storjohann. “Time and space efficient generators for quasiseparable matrices”. In:Journal of Symbolic Computation 85 (2018), pp. 224–246

work page 2018
[77]

Hyena Hierarchy: Towards Larger Convolutional Language Models

Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. “Hyena Hierarchy: Towards Larger Convolutional Language Models”. In: The International Conference on Machine Learning (ICML) . 2023

work page 2023
[78]

Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum

Hadi Pouransari, Chun-Liang Li, Jen-Hao Rick Chang, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, and Oncel Tuzel. “Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum”. In: arXiv preprint arXiv:2405.13226 (2024)

work page arXiv 2024
[79]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah Smith, and Mike Lewis. “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation”. In: International Conference on Learning Representations . 2022

work page 2022
[80]

Toeplitz Neural Network for Sequence Modeling

Zhen Qin, Xiaodong Han, Weixuan Sun, Bowen He, Dong Li, Dongxu Li, Yuchao Dai, Lingpeng Kong, and Yiran Zhong. “Toeplitz Neural Network for Sequence Modeling”. In:The International Conference on Learning Represen- tations (ICLR). 2023

work page 2023

Showing first 80 references.