arxiv: 2205.14135 · v2 · submitted 2022-05-27 · 💻 cs.LG

Recognition: 3 theorem links

· Lean Theorem

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao , Daniel Y. Fu , Stefano Ermon , Atri Rudra , Christopher R\'e

Authors on Pith no claims yet

Pith reviewed 2026-05-12 16:16 UTC · model grok-4.3

classification 💻 cs.LG

keywords attentiontransformersself-attentionGPU memoryIO complexitytilinglong sequencesexact attention

0 comments

The pith

FlashAttention reduces GPU memory traffic for exact attention computation using tiling, achieving faster Transformer training on long sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Transformers struggle with long sequences because self-attention takes quadratic time and memory. The paper shows that the real bottleneck is often not compute but moving data between the GPU's slow high-bandwidth memory and fast on-chip SRAM. FlashAttention addresses this by tiling the attention computation so that intermediate results stay in fast memory as much as possible. This keeps the attention exact, unlike approximation methods, yet cuts down on slow memory accesses. The result is measurable speedups in training and the ability to use much longer sequences in models.

Core claim

FlashAttention is an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. It requires fewer HBM accesses than standard attention and is optimal for a range of SRAM sizes. This yields 15% end-to-end speedup on BERT-large, 3x on GPT-2, 2.4x on long-range arena, and enables the first Transformers to succeed on Path-X and Path-256 tasks with sequences of 16K and 64K.

What carries the argument

Tiling that divides attention into blocks sized for SRAM and fuses softmax with matrix multiplies to avoid repeated HBM writes.

If this is right

Transformers train 15% to 3x faster end-to-end on sequences from 512 to 4K.
Models achieve better perplexity and classification accuracy from longer contexts.
Block-sparse FlashAttention runs faster than prior approximate attention methods while remaining close to exact.
Transformers reach above-chance performance on 16K and 64K sequence tasks for the first time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar tiling ideas could speed up other memory-heavy layers such as convolutions or recurrent steps.
Hardware with larger on-chip SRAM would amplify the benefits of this style of algorithm.
The approach suggests attention can stay exact even as sequence lengths grow into tens of thousands.

Load-bearing premise

The tiling and kernel implementation incurs negligible overhead from synchronization or register pressure and generalizes across GPU architectures without requiring architecture-specific tuning beyond the stated SRAM sizes.

What would settle it

Compare HBM access counts or wall-clock runtime of FlashAttention versus standard attention on a long sequence; if the number of accesses or runtime shows no reduction, the IO savings claim would be false.

read the original abstract

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3$\times$ speedup on GPT-2 (seq. length 1K), and 2.4$\times$ speedup on long-range arena (seq. length 1K-4K). FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlashAttention shows how to make exact attention faster on GPUs by tiling to cut HBM accesses, with analysis and speedups that hold up in the reported experiments.

read the letter

The main takeaway is that this paper gives a practical algorithm for exact attention that reduces memory traffic between HBM and SRAM through careful tiling of the QKV matrices and the attention computation. The IO complexity analysis in section 3 derives a bound showing fewer accesses than the standard implementation, and it claims optimality for a range of SRAM sizes under a two-level memory model. That part is new relative to the approximate attention papers they cite, which mostly targeted compute reduction instead of IO awareness.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes FlashAttention, an IO-aware exact attention algorithm that employs tiling to minimize reads and writes between GPU high-bandwidth memory (HBM) and on-chip SRAM. It derives an IO complexity analysis showing fewer HBM accesses than standard attention and optimality for a range of SRAM sizes under a two-level memory model. The approach is extended to block-sparse attention, and experiments demonstrate wall-clock speedups (e.g., 3x on GPT-2 with seq. length 1K) along with improved long-context performance on tasks such as Path-X (seq. length 16K).

Significance. If the central claims hold, the work is significant for providing a hardware-aware optimization that achieves practical speedups and longer contexts without trading off exactness or model quality, unlike prior approximate attention methods. The first-principles IO complexity analysis (derived from standard memory-access counting) and the empirical results enabling new capabilities (e.g., better-than-chance Path-256 performance) represent a concrete advance in efficient Transformer training and inference.

major comments (1)

[§3] §3 (IO complexity analysis): The optimality claim for a range of SRAM sizes and the assertion of fewer HBM accesses rest on an idealized two-level memory model; the manuscript does not report direct profiling of actual HBM traffic or an overhead breakdown (synchronization, register pressure, bank conflicts) in the CUDA kernel to confirm that the predicted reduction is realized when tile counts increase with sequence length.

minor comments (2)

[§4] §4 (experiments): Speedup results are reported exclusively on A100 GPUs; adding results or analysis for other architectures (e.g., V100 or H100) with the same tiling parameters would strengthen the generalization claim.
[Abstract] Abstract and §5: The block-sparse extension is presented as faster than existing approximate methods, but the manuscript could clarify the sparsity pattern selection criteria and any quality trade-offs more explicitly in the main text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our work, the recognition of its significance, and the recommendation for minor revision. We address the major comment point by point below.

read point-by-point responses

Referee: [§3] §3 (IO complexity analysis): The optimality claim for a range of SRAM sizes and the assertion of fewer HBM accesses rest on an idealized two-level memory model; the manuscript does not report direct profiling of actual HBM traffic or an overhead breakdown (synchronization, register pressure, bank conflicts) in the CUDA kernel to confirm that the predicted reduction is realized when tile counts increase with sequence length.

Authors: We thank the referee for this careful observation. The IO complexity analysis in Section 3 is derived under the standard two-level memory hierarchy model (HBM vs. on-chip SRAM), which is the conventional abstraction used in the analysis of cache-efficient and IO-aware algorithms. Under this model we prove both the asymptotic reduction in HBM accesses relative to standard attention and optimality for a range of SRAM sizes. We acknowledge that the manuscript does not include direct hardware-counter measurements of HBM traffic or a fine-grained breakdown of CUDA-level overheads (synchronization, register pressure, bank conflicts). Such profiling is technically challenging to obtain without substantial instrumentation overhead and is not reported in the current version. Instead, we validate the practical impact of the IO savings through wall-clock timings that improve with increasing sequence length (Figures 3–5 and Tables 1–2), an outcome that would be impossible if the predicted HBM reduction were not realized. The open-source kernel is publicly available for independent profiling. In the revised manuscript we will add a short paragraph in Section 3 clarifying the modeling assumptions and the empirical validation strategy. revision: yes

Circularity Check

0 steps flagged

No significant circularity: IO complexity follows from standard memory-access counting on the tiled algorithm

full rationale

The paper's core derivation (IO complexity of FlashAttention via tiling) is obtained by direct counting of HBM reads/writes under a standard two-level memory model, without any fitted parameters, self-referential definitions, or load-bearing self-citations. The optimality claim for a range of SRAM sizes is a direct consequence of the access formulas applied to the algorithm's structure. Empirical speedups and block-sparse extension are separate experimental results, not part of the derivation chain. No step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions about GPU memory hierarchy sizes and access costs plus the correctness of the tiling implementation; no free parameters are fitted to data and no new entities are postulated.

axioms (1)

domain assumption GPU memory hierarchy consists of HBM (slow, large) and SRAM (fast, small) with known relative access costs
Invoked in the IO complexity analysis section referenced in the abstract.

pith-pipeline@v0.9.0 · 5641 in / 1124 out tokens · 58351 ms · 2026-05-12T16:16:31.121428+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation Jcost uniqueness unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM.
DimensionForcing D=3 forcing unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes.
HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Efficient Training on Multiple Consumer GPUs with RoundPipe
cs.DC 2026-04 conditional novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...
Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation
cs.DC 2026-04 unverdicted novelty 8.0

Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs
cs.LG 2026-05 unverdicted novelty 7.0

CUDAHercules benchmark demonstrates that leading LLMs generate functional CUDA code but fail to recover expert-level optimization strategies needed for peak performance on Ampere, Hopper, and Blackwell GPUs.
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
cs.DC 2026-05 unverdicted novelty 7.0

Dooly reduces LLM inference profiling costs by 56.4% via configuration-agnostic taint-based labeling and selective database reuse, delivering simulation accuracy within 5% MAPE for TTFT and 8% for TPOT across 12 models.
Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs
cs.SE 2026-05 conditional novelty 7.0

Kerncap automatically extracts isolated, reproducible GPU kernels from large HIP and Triton applications on AMD GPUs by capturing HSA dispatches and producing self-contained reproducer projects that preserve virtual-a...
Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs
cs.SE 2026-05 conditional novelty 7.0

Kerncap automates extraction of faithful, self-contained GPU kernel reproducers from AMD HIP and Triton workloads via HSA interception and address-space closure, delivering 13.6x faster isolated tuning.
Projection-Free Transformers via Gaussian Kernel Attention
cs.LG 2026-05 unverdicted novelty 7.0

Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.
CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration
cs.DC 2026-04 unverdicted novelty 7.0

CacheFlow cuts TTFT by 10-62% in batched LLM serving via 3D-parallel KV cache restoration and a two-pointer scheduler that overlaps recompute and I/O.
Transactional Attention: Semantic Sponsorship for KV-Cache Retention
cs.CL 2026-04 unverdicted novelty 7.0

Transactional Attention uses semantic sponsorship from anchor patterns to retain dormant critical tokens in KV caches, achieving 100% credential retrieval at 16 tokens where all prior methods fail.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
cs.LG 2022-10 unverdicted novelty 7.0

GPTQ quantizes 175B-parameter GPT models to 3-4 bits per weight in one shot using approximate second-order information, achieving negligible accuracy degradation and 3-4x inference speedups.
Remember to Forget: Gated Adaptive Positional Encoding
cs.LG 2026-05 unverdicted novelty 6.0

GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
cs.AR 2026-05 unverdicted novelty 6.0

KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism
cs.DC 2026-05 unverdicted novelty 6.0

Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study
cs.SE 2026-04 unverdicted novelty 6.0

Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models
cs.LG 2026-04 unverdicted novelty 6.0

HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
Efficient Streaming Language Models with Attention Sinks
cs.CL 2023-09 accept novelty 6.0

StreamingLLM lets finite-window LLMs generalize to infinite-length sequences by retaining initial-token KV states as attention sinks, enabling stable streaming inference up to 4M tokens.
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
cs.CL 2023-05 unverdicted novelty 6.0

Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.
Token Merging: Your ViT But Faster
cs.CV 2022-10 unverdicted novelty 6.0

Token Merging (ToMe) doubles the throughput of large Vision Transformers on images, video, and audio by merging similar tokens with a fast matching algorithm, incurring only 0.2-0.4% accuracy loss.
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
cs.CE 2026-05 unverdicted novelty 5.0

LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
Kaczmarz Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
A Hybrid Method for Low-Resource Named Entity Recognition
cs.CE 2026-05 unverdicted novelty 5.0

The hybrid method with LLM-augmented data achieves F1 improvements of 7-24 points over baselines on five Vietnamese domain datasets.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
cs.LG 2026-05 accept novelty 5.0

Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training
cs.DC 2026-04 unverdicted novelty 5.0

UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.
Reinforcement Learning Improves LLM Accuracy and Reasoning in Disease Classification from Radiology Reports
cs.AI 2026-04 unverdicted novelty 5.0

SFT followed by GRPO improves LLM accuracy and reasoning recall in disease classification from radiology reports on three radiologist-annotated datasets.
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
cs.DC 2026-04 unverdicted novelty 5.0

HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...
Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM
cs.DC 2026-04 unverdicted novelty 4.0

A framework combines multi-LoRA runtime switching, multi-stream stylistic decoding, and Dynamic Self-Speculative Decoding with INT4 quantization to achieve 4-6x memory and latency gains for on-device inference of a on...

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · cited by 27 Pith papers · 6 internal anchors

[1]

The input/output complexity of sorting and related problems

Alok Aggarwal and S Vitter, Jeﬀrey. The input/output complexity of sorting and related problems. Communications of the ACM, 31(9):1116–1127, 1988

work page 1988
[2]

LambdaNetworks: Modeling long-range interactions without attention.arXiv preprint arXiv:2102.08602, 2021

Irwan Bello. LambdaNetworks: Modeling long-range interactions without attention.arXiv preprint arXiv:2102.08602, 2021

work page arXiv 2021
[3]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[4]

An updated set of basic linear algebra subprograms (blas)

L Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R Clint Whaley, James Demmel, Jack Dongarra, Iain Duﬀ, Sven Hammarling, Greg Henry, et al. An updated set of basic linear algebra subprograms (blas). ACM Transactions on Mathematical Software, 28(2):135–151, 2002

work page 2002
[5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[6]

Neural legal judgment prediction in English

Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. Neural legal judgment prediction in English. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4317–4323, Florence, Italy, 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1424. URL https://www.aclweb.org/anthology/P19-1424

work page doi:10.18653/v1/p19-1424 2019
[7]

Paragraph-level rationale extraction through regularization: A case study on european court of human rights cases

Ilias Chalkidis, Manos Fergadiotis, Dimitrios Tsarapatsanis, Nikolaos Aletras, Ion Androutsopoulos, and Prodromos Malakasiotis. Paragraph-level rationale extraction through regularization: A case study on european court of human rights cases. InProceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguist...

work page 2021
[8]

Kernel operations on the gpu, with autodiﬀ, without memory overﬂows.Journal of Machine Learning Research, 22(74):1–6, 2021

Benjamin Charlier, Jean Feydy, Joan Alexis Glaunès, François-David Collin, and Ghislain Durif. Kernel operations on the gpu, with autodiﬀ, without memory overﬂows.Journal of Machine Learning Research, 22(74):1–6, 2021. URL http://jmlr.org/papers/v22/20-275.html

work page 2021
[9]

Scatterbrain: Unifying sparse and low-rank attention

Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. Scatterbrain: Unifying sparse and low-rank attention. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[10]

Training Deep Nets with Sublinear Memory Cost

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[12]

Rethinking attention with performers

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. InInternational Conference on Learning Representations (ICLR), 2020

work page 2020
[13]

Revisiting transformer-based models for long document classiﬁcation.arXiv preprint arXiv:2204.06683, 2022

Xiang Dai, Ilias Chalkidis, Sune Darkner, and Desmond Elliott. Revisiting transformer-based models for long document classiﬁcation.arXiv preprint arXiv:2204.06683, 2022

work page arXiv 2022
[14]

Transformer-XL: Attentive language models beyond a ﬁxed-length context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a ﬁxed-length context. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, 2019. 11

work page 2019
[15]

Learning fast algorithms for linear transforms using butterﬂy factorizations

Tri Dao, Albert Gu, Matthew Eichhorn, Atri Rudra, and Christopher Ré. Learning fast algorithms for linear transforms using butterﬂy factorizations. InInternational Conference on Machine Learning (ICML), 2019

work page 2019
[16]

Kaleidoscope: An eﬃcient, learnable representation for all structured linear maps

Tri Dao, Nimit Sohoni, Albert Gu, Matthew Eichhorn, Amit Blonder, Megan Leszczynski, Atri Rudra, and Christopher Ré. Kaleidoscope: An eﬃcient, learnable representation for all structured linear maps. In International Conference on Learning Representations (ICLR), 2020

work page 2020
[17]

Pixelated butterﬂy: Simple and eﬃcient sparse training for neural network models

Tri Dao, Beidi Chen, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri Rudra, and Christopher Ré. Pixelated butterﬂy: Simple and eﬃcient sparse training for neural network models. InInternational Conference on Learning Representations (ICLR), 2022

work page 2022
[18]

Monarch: Expressive structured matrices for eﬃcient and accurate training

Tri Dao, Beidi Chen, Nimit Sohoni, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, and Christopher Ré. Monarch: Expressive structured matrices for eﬃcient and accurate training. In International Conference on Machine Learning (ICML), 2022

work page 2022
[19]

Smyrf-eﬃcient attention using asymmetric clustering.Advances in Neural Information Processing Systems, 33:6476–6489, 2020

Giannis Daras, Nikita Kitaev, Augustus Odena, and Alexandros G Dimakis. Smyrf-eﬃcient attention using asymmetric clustering.Advances in Neural Information Processing Systems, 33:6476–6489, 2020

work page 2020
[20]

A two-pronged progress in structured dense matrix vector multiplication

Christopher De Sa, Albert Gu, Rohan Puttagunta, Christopher Ré, and Atri Rudra. A two-pronged progress in structured dense matrix vector multiplication. InProceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1060–1079. SIAM, 2018

work page 2018
[21]

The working set model for program behavior.Communications of the ACM, 11(5): 323–333, 1968

Peter J Denning. The working set model for program behavior.Communications of the ACM, 11(5): 323–333, 1968

work page 1968
[22]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. 2019

work page 2019
[23]

Learning to prune deep neural networks via layer-wise optimal brain surgeon.arXiv preprint arXiv:1705.07565, 2017

Xin Dong, Shangyu Chen, and Sinno Jialin Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon.arXiv preprint arXiv:1705.07565, 2017

work page arXiv 2017
[24]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2020

work page 2020
[25]

On a new class of structured matrices.Integral Equations and Operator Theory, 34(3):293–324, 1999

Y Eidelman and I Gohberg. On a new class of structured matrices.Integral Equations and Operator Theory, 34(3):293–324, 1999

work page 1999
[26]

Fast geometric learning with symbolic matrices

Jean Feydy, Joan Glaunès, Benjamin Charlier, and Michael Bronstein. Fast geometric learning with symbolic matrices. Advances in Neural Information Processing Systems, 33, 2020

work page 2020
[27]

Springer, 2006

Jörg Flum and Martin Grohe.Parameterized Complexity Theory. Springer, 2006

work page 2006
[28]

The lottery ticket hypothesis: Finding sparse, trainable neural networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations, 2018

work page 2018
[29]

Stabilizing the lottery ticket hypothesis.arXiv preprint arXiv:1903.01611, 2019

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Stabilizing the lottery ticket hypothesis.arXiv preprint arXiv:1903.01611, 2019

work page arXiv 1903
[30]

Linear mode connectivity and the lottery ticket hypothesis

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. InInternational Conference on Machine Learning, pages 3259–3269. PMLR, 2020

work page 2020
[31]

It’s raw! audio generation with state-space models

Karan Goel, Albert Gu, Chris Donahue, and Christopher Ré. It’s raw! audio generation with state-space models. In International Conference on Machine Learning (ICML), 2022

work page 2022
[32]

Openwebtext corpus, 2019

Aaron Gokaslan, Vanya Cohen, Pavlick Ellie, and Stefanie Tellex. Openwebtext corpus, 2019. 12

work page 2019
[33]

Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals.Data mining and knowledge discovery, 1(1):29–53, 1997

Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals.Data mining and knowledge discovery, 1(1):29–53, 1997

work page 1997
[34]

SIAM, 2008

Andreas Griewank and Andrea Walther.Evaluating derivatives: principles and techniques of algorithmic diﬀerentiation. SIAM, 2008

work page 2008
[35]

Hippo: Recurrent memory with optimal polynomial projections

Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hippo: Recurrent memory with optimal polynomial projections. InAdvances in neural information processing systems (NeurIPS), 2020

work page 2020
[36]

Combining recurrent, convolutional, and continuous-time models with linear state space layers.Advances in Neural Information Processing Systems, 34, 2021

Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers.Advances in Neural Information Processing Systems, 34, 2021

work page 2021
[37]

Eﬃciently modeling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher Ré. Eﬃciently modeling long sequences with structured state spaces. In The International Conference on Learning Representations (ICLR), 2022

work page 2022
[38]

Learning both Weights and Connections for Efficient Neural Networks

Song Han, Jeﬀ Pool, John Tran, and William J Dally. Learning both weights and connections for eﬃcient neural networks.arXiv preprint arXiv:1506.02626, 2015

work page Pith review arXiv 2015
[39]

Deep compression: Compressing deep neural networks with pruning, trained quantization and huﬀman coding

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huﬀman coding. In International Conference on Learning Representations, 2016

work page 2016
[40]

Memory hierarchy design.Computer Architecture: A Quantitative Approach, pages 390–525, 2003

John Hennessy and David Patterson. Memory hierarchy design.Computer Architecture: A Quantitative Approach, pages 390–525, 2003

work page 2003
[41]

The hardware lottery

Sara Hooker. The hardware lottery.arXiv preprint arXiv:2009.06489, 2020

work page arXiv 2009
[42]

Transformer quality in linear time.arXiv preprint arXiv:2202.10447, 2022

Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc V Le. Transformer quality in linear time.arXiv preprint arXiv:2202.10447, 2022

work page arXiv 2022
[43]

Data movement is all you need: A case study on optimizing transformers.Proceedings of Machine Learning and Systems, 3: 711–732, 2021

Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoeﬂer. Data movement is all you need: A case study on optimizing transformers.Proceedings of Machine Learning and Systems, 3: 711–732, 2021

work page 2021
[44]

Dissecting the Ampere GPU architecture via microbenchmarking

Zhe Jia and Peter Van Sandt. Dissecting the Ampere GPU architecture via microbenchmarking. GPU Technology Conference, 2021

work page 2021
[45]

Scarpazza

Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. Dissecting the nvidia Volta GPU architecture via microbenchmarking.arXiv preprint arXiv:1804.06826, 2018

work page arXiv 2018
[46]

Dissecting the graphcore IPU architecture via microbenchmarking.arXiv preprint arXiv:1912.03413, 2019

Zhe Jia, Blake Tillman, Marco Maggioni, and Daniele Paolo Scarpazza. Dissecting the graphcore IPU architecture via microbenchmarking.arXiv preprint arXiv:1912.03413, 2019

work page arXiv 1912
[47]

Mimic-iii, a freely accessible critical care database.Scientiﬁc data, 3(1):1–9, 2016

Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database.Scientiﬁc data, 3(1):1–9, 2016

work page 2016
[48]

In-datacenter performance analysis of a tensor processing unit

Norman P Jouppi, Cliﬀ Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. InProceedings of the 44th annual international symposium on computer architecture, pages 1–12, 2017

work page 2017
[49]

Displacement ranks of matrices and linear equations

Thomas Kailath, Sun-Yuan Kung, and Martin Morf. Displacement ranks of matrices and linear equations. Journal of Mathematical Analysis and Applications, 68(2):395–407, 1979

work page 1979
[50]

Transformers are RNNs: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning, pages 5156–5165. PMLR, 2020. 13

work page 2020
[51]

Reformer: The eﬃcient transformer

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The eﬃcient transformer. InThe International Conference on Machine Learning (ICML), 2020

work page 2020
[52]

Albert: A lite BEDRT for self-supervised learning of language representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite BEDRT for self-supervised learning of language representations. InThe International Conference on Learning Representations (ICLR), 2020

work page 2020
[53]

The deep learning compiler: A comprehensive survey

Mingzhen Li, Yi Liu, Xiaoyan Liu, Qingxiao Sun, Xin You, Hailong Yang, Zhongzhi Luan, Lin Gan, Guangwen Yang, and Depei Qian. The deep learning compiler: A comprehensive survey. IEEE Transactions on Parallel and Distributed Systems, 32(3):708–727, 2020

work page 2020
[54]

Sub-linear memory: How to make performers slim.arXiv preprint arXiv:2012.11346, 2020

Valerii Likhosherstov, Krzysztof Choromanski, Jared Davis, Xingyou Song, and Adrian Weller. Sub-linear memory: How to make performers slim.arXiv preprint arXiv:2012.11346, 2020

work page arXiv 2012
[55]

Runtime neural pruning

Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neural pruning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017
[56]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[57]

Luna: Linear uniﬁed nested attention.Advances in Neural Information Processing Systems, 34, 2021

Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, and Luke Zettlemoyer. Luna: Linear uniﬁed nested attention.Advances in Neural Information Processing Systems, 34, 2021

work page 2021
[58]

Mlperf training benchmark.Proceedings of Machine Learning and Systems, 2:336–349, 2020

Peter Mattson, Christine Cheng, Gregory Diamos, Cody Coleman, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, et al. Mlperf training benchmark.Proceedings of Machine Learning and Systems, 2:336–349, 2020

work page 2020
[59]

Scalability! but at whatfCOSTg? In 15th Workshop on Hot Topics in Operating Systems (HotOS XV), 2015

Frank McSherry, Michael Isard, and Derek G Murray. Scalability! but at whatfCOSTg? In 15th Workshop on Hot Topics in Operating Systems (HotOS XV), 2015

work page 2015
[60]

Online normalizer calculation for softmax,

Maxim Milakov and Natalia Gimelshein. Online normalizer calculation for softmax.arXiv preprint arXiv:1805.02867, 2018

work page arXiv 2018
[61]

Nvidia Tesla V100 GPU architecture, 2017

NVIDIA. Nvidia Tesla V100 GPU architecture, 2017

work page 2017
[62]

Nvidia A100 tensor core GPU architecture, 2020

NVIDIA. Nvidia A100 tensor core GPU architecture, 2020

work page 2020
[63]

Nvidia H100 tensor core GPU architecture, 2022

NVIDIA. Nvidia H100 tensor core GPU architecture, 2022

work page 2022
[64]

Random butterﬂy transformations with applications in computational linear algebra

D Stott Parker. Random butterﬂy transformations with applications in computational linear algebra. 1995

work page 1995
[65]

Pytorch: An imperative style, high- performance deep learning library.Advances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high- performance deep learning library.Advances in neural information processing systems, 32, 2019

work page 2019
[66]

Self-attention does not needo(n 2)memory.arXiv preprint arXiv:2112.05682,

Markus N Rabe and Charles Staats. Self-attention does not need 𝑂¹𝑛2º memory. arXiv preprint arXiv:2112.05682, 2021

work page arXiv 2021
[67]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeﬀrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019
[68]

Do transformers need deep long-range memory? InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, July 2020

Jack Rae and Ali Razavi. Do transformers need deep long-range memory? InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, July 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.acl-main.672

work page 2020
[69]

Compressive trans- formers for long-range sequence modelling

Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive trans- formers for long-range sequence modelling. InThe International Conference on Learning Representations (ICLR), 2020. 14

work page 2020
[70]

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines.Acm Sigplan Notices, 48(6):519–530, 2013

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines.Acm Sigplan Notices, 48(6):519–530, 2013

work page 2013
[71]

McGraw-Hill New York, 2003

Raghu Ramakrishnan, Johannes Gehrke, and Johannes Gehrke.Database management systems, volume 3. McGraw-Hill New York, 2003

work page 2003
[72]

Parallel stochastic gradient algorithms for large-scale matrix completion

Benjamin Recht and Christopher Ré. Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, 5(2):201–226, 2013

work page 2013
[73]

Combiner: Full attention transformer with sparse computation cost.Advances in Neural Information Processing Systems, 34, 2021

Hongyu Ren, Hanjun Dai, Zihang Dai, Mengjiao Yang, Jure Leskovec, Dale Schuurmans, and Bo Dai. Combiner: Full attention transformer with sparse computation cost.Advances in Neural Information Processing Systems, 34, 2021

work page 2021
[74]

Eﬃcient content-based sparse attention with routing transformers.Transactions of the Association for Computational Linguistics, 9: 53–68, 2021

Aurko Roy, Mohammad Saﬀar, Ashish Vaswani, and David Grangier. Eﬃcient content-based sparse attention with routing transformers.Transactions of the Association for Computational Linguistics, 9: 53–68, 2021

work page 2021
[75]

XLA: Compiling machine learning for peak performance

Amit Sabne. XLA: Compiling machine learning for peak performance. 2020

work page 2020
[76]

Movement pruning: Adaptive sparsity by ﬁne-tuning

Victor Sanh, Thomas Wolf, and Alexander M Rush. Movement pruning: Adaptive sparsity by ﬁne-tuning. arXiv preprint arXiv:2005.07683, 2020

work page arXiv 2005
[77]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[78]

Structured transforms for small-footprint deep learning

Vikas Sindhwani, Tara Sainath, and Sanjiv Kumar. Structured transforms for small-footprint deep learning. In Advances in Neural Information Processing Systems, pages 3088–3096, 2015

work page 2015
[79]

Adaptive attention span in transformers

Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. Adaptive attention span in transformers. InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2019

work page 2019
[80]

Long range arena: A benchmark for eﬃcient transformers

Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena: A benchmark for eﬃcient transformers. In International Conference on Learning Representations, 2020

work page 2020

Showing first 80 references.