pith. machine review for the scientific record. sign in

arxiv: 2205.14135 · v2 · submitted 2022-05-27 · 💻 cs.LG

Recognition: 3 theorem links

· Lean Theorem

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Authors on Pith no claims yet

Pith reviewed 2026-05-12 16:16 UTC · model grok-4.3

classification 💻 cs.LG
keywords attentiontransformersself-attentionGPU memoryIO complexitytilinglong sequencesexact attention
0
0 comments X

The pith

FlashAttention reduces GPU memory traffic for exact attention computation using tiling, achieving faster Transformer training on long sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Transformers struggle with long sequences because self-attention takes quadratic time and memory. The paper shows that the real bottleneck is often not compute but moving data between the GPU's slow high-bandwidth memory and fast on-chip SRAM. FlashAttention addresses this by tiling the attention computation so that intermediate results stay in fast memory as much as possible. This keeps the attention exact, unlike approximation methods, yet cuts down on slow memory accesses. The result is measurable speedups in training and the ability to use much longer sequences in models.

Core claim

FlashAttention is an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. It requires fewer HBM accesses than standard attention and is optimal for a range of SRAM sizes. This yields 15% end-to-end speedup on BERT-large, 3x on GPT-2, 2.4x on long-range arena, and enables the first Transformers to succeed on Path-X and Path-256 tasks with sequences of 16K and 64K.

What carries the argument

Tiling that divides attention into blocks sized for SRAM and fuses softmax with matrix multiplies to avoid repeated HBM writes.

If this is right

  • Transformers train 15% to 3x faster end-to-end on sequences from 512 to 4K.
  • Models achieve better perplexity and classification accuracy from longer contexts.
  • Block-sparse FlashAttention runs faster than prior approximate attention methods while remaining close to exact.
  • Transformers reach above-chance performance on 16K and 64K sequence tasks for the first time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar tiling ideas could speed up other memory-heavy layers such as convolutions or recurrent steps.
  • Hardware with larger on-chip SRAM would amplify the benefits of this style of algorithm.
  • The approach suggests attention can stay exact even as sequence lengths grow into tens of thousands.

Load-bearing premise

The tiling and kernel implementation incurs negligible overhead from synchronization or register pressure and generalizes across GPU architectures without requiring architecture-specific tuning beyond the stated SRAM sizes.

What would settle it

Compare HBM access counts or wall-clock runtime of FlashAttention versus standard attention on a long sequence; if the number of accesses or runtime shows no reduction, the IO savings claim would be false.

read the original abstract

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3$\times$ speedup on GPT-2 (seq. length 1K), and 2.4$\times$ speedup on long-range arena (seq. length 1K-4K). FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes FlashAttention, an IO-aware exact attention algorithm that employs tiling to minimize reads and writes between GPU high-bandwidth memory (HBM) and on-chip SRAM. It derives an IO complexity analysis showing fewer HBM accesses than standard attention and optimality for a range of SRAM sizes under a two-level memory model. The approach is extended to block-sparse attention, and experiments demonstrate wall-clock speedups (e.g., 3x on GPT-2 with seq. length 1K) along with improved long-context performance on tasks such as Path-X (seq. length 16K).

Significance. If the central claims hold, the work is significant for providing a hardware-aware optimization that achieves practical speedups and longer contexts without trading off exactness or model quality, unlike prior approximate attention methods. The first-principles IO complexity analysis (derived from standard memory-access counting) and the empirical results enabling new capabilities (e.g., better-than-chance Path-256 performance) represent a concrete advance in efficient Transformer training and inference.

major comments (1)
  1. [§3] §3 (IO complexity analysis): The optimality claim for a range of SRAM sizes and the assertion of fewer HBM accesses rest on an idealized two-level memory model; the manuscript does not report direct profiling of actual HBM traffic or an overhead breakdown (synchronization, register pressure, bank conflicts) in the CUDA kernel to confirm that the predicted reduction is realized when tile counts increase with sequence length.
minor comments (2)
  1. [§4] §4 (experiments): Speedup results are reported exclusively on A100 GPUs; adding results or analysis for other architectures (e.g., V100 or H100) with the same tiling parameters would strengthen the generalization claim.
  2. [Abstract] Abstract and §5: The block-sparse extension is presented as faster than existing approximate methods, but the manuscript could clarify the sparsity pattern selection criteria and any quality trade-offs more explicitly in the main text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our work, the recognition of its significance, and the recommendation for minor revision. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [§3] §3 (IO complexity analysis): The optimality claim for a range of SRAM sizes and the assertion of fewer HBM accesses rest on an idealized two-level memory model; the manuscript does not report direct profiling of actual HBM traffic or an overhead breakdown (synchronization, register pressure, bank conflicts) in the CUDA kernel to confirm that the predicted reduction is realized when tile counts increase with sequence length.

    Authors: We thank the referee for this careful observation. The IO complexity analysis in Section 3 is derived under the standard two-level memory hierarchy model (HBM vs. on-chip SRAM), which is the conventional abstraction used in the analysis of cache-efficient and IO-aware algorithms. Under this model we prove both the asymptotic reduction in HBM accesses relative to standard attention and optimality for a range of SRAM sizes. We acknowledge that the manuscript does not include direct hardware-counter measurements of HBM traffic or a fine-grained breakdown of CUDA-level overheads (synchronization, register pressure, bank conflicts). Such profiling is technically challenging to obtain without substantial instrumentation overhead and is not reported in the current version. Instead, we validate the practical impact of the IO savings through wall-clock timings that improve with increasing sequence length (Figures 3–5 and Tables 1–2), an outcome that would be impossible if the predicted HBM reduction were not realized. The open-source kernel is publicly available for independent profiling. In the revised manuscript we will add a short paragraph in Section 3 clarifying the modeling assumptions and the empirical validation strategy. revision: yes

Circularity Check

0 steps flagged

No significant circularity: IO complexity follows from standard memory-access counting on the tiled algorithm

full rationale

The paper's core derivation (IO complexity of FlashAttention via tiling) is obtained by direct counting of HBM reads/writes under a standard two-level memory model, without any fitted parameters, self-referential definitions, or load-bearing self-citations. The optimality claim for a range of SRAM sizes is a direct consequence of the access formulas applied to the algorithm's structure. Empirical speedups and block-sparse extension are separate experimental results, not part of the derivation chain. No step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions about GPU memory hierarchy sizes and access costs plus the correctness of the tiling implementation; no free parameters are fitted to data and no new entities are postulated.

axioms (1)
  • domain assumption GPU memory hierarchy consists of HBM (slow, large) and SRAM (fast, small) with known relative access costs
    Invoked in the IO complexity analysis section referenced in the abstract.

pith-pipeline@v0.9.0 · 5641 in / 1124 out tokens · 58351 ms · 2026-05-12T16:16:31.121428+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation Jcost uniqueness unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM.

  • DimensionForcing D=3 forcing unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes.

  • HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Efficient Training on Multiple Consumer GPUs with RoundPipe

    cs.DC 2026-04 conditional novelty 8.0

    RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...

  2. Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation

    cs.DC 2026-04 unverdicted novelty 8.0

    Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.

  3. CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    CUDAHercules benchmark demonstrates that leading LLMs generate functional CUDA code but fail to recover expert-level optimization strategies needed for peak performance on Ampere, Hopper, and Blackwell GPUs.

  4. Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation

    cs.DC 2026-05 unverdicted novelty 7.0

    Dooly reduces LLM inference profiling costs by 56.4% via configuration-agnostic taint-based labeling and selective database reuse, delivering simulation accuracy within 5% MAPE for TTFT and 8% for TPOT across 12 models.

  5. Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

    cs.SE 2026-05 conditional novelty 7.0

    Kerncap automatically extracts isolated, reproducible GPU kernels from large HIP and Triton applications on AMD GPUs by capturing HSA dispatches and producing self-contained reproducer projects that preserve virtual-a...

  6. Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

    cs.SE 2026-05 conditional novelty 7.0

    Kerncap automates extraction of faithful, self-contained GPU kernel reproducers from AMD HIP and Triton workloads via HSA interception and address-space closure, delivering 13.6x faster isolated tuning.

  7. Projection-Free Transformers via Gaussian Kernel Attention

    cs.LG 2026-05 unverdicted novelty 7.0

    Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.

  8. CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration

    cs.DC 2026-04 unverdicted novelty 7.0

    CacheFlow cuts TTFT by 10-62% in batched LLM serving via 3D-parallel KV cache restoration and a two-pointer scheduler that overlaps recompute and I/O.

  9. Transactional Attention: Semantic Sponsorship for KV-Cache Retention

    cs.CL 2026-04 unverdicted novelty 7.0

    Transactional Attention uses semantic sponsorship from anchor patterns to retain dormant critical tokens in KV caches, achieving 100% credential retrieval at 16 tokens where all prior methods fail.

  10. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    cs.LG 2025-02 unverdicted novelty 7.0

    A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

  11. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    cs.LG 2022-10 unverdicted novelty 7.0

    GPTQ quantizes 175B-parameter GPT models to 3-4 bits per weight in one shot using approximate second-order information, achieving negligible accuracy degradation and 3-4x inference speedups.

  12. Remember to Forget: Gated Adaptive Positional Encoding

    cs.LG 2026-05 unverdicted novelty 6.0

    GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.

  13. KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving

    cs.AR 2026-05 unverdicted novelty 6.0

    KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.

  14. Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism

    cs.DC 2026-05 unverdicted novelty 6.0

    Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.

  15. Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study

    cs.SE 2026-04 unverdicted novelty 6.0

    Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.

  16. HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models

    cs.LG 2026-04 unverdicted novelty 6.0

    HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.

  17. HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

    cs.AI 2026-04 unverdicted novelty 6.0

    HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

  18. Efficient Streaming Language Models with Attention Sinks

    cs.CL 2023-09 accept novelty 6.0

    StreamingLLM lets finite-window LLMs generalize to infinite-length sequences by retaining initial-token KV states as attention sinks, enabling stable streaming inference up to 4M tokens.

  19. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    cs.CL 2023-05 unverdicted novelty 6.0

    Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.

  20. Token Merging: Your ViT But Faster

    cs.CV 2022-10 unverdicted novelty 6.0

    Token Merging (ToMe) doubles the throughput of large Vision Transformers on images, video, and audio by merging similar tokens with a fast matching algorithm, incurring only 0.2-0.4% accuracy loss.

  21. Position: LLM Inference Should Be Evaluated as Energy-to-Token Production

    cs.CE 2026-05 unverdicted novelty 5.0

    LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.

  22. Kaczmarz Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...

  23. A Hybrid Method for Low-Resource Named Entity Recognition

    cs.CE 2026-05 unverdicted novelty 5.0

    The hybrid method with LLM-augmented data achieves F1 improvements of 7-24 points over baselines on five Vietnamese domain datasets.

  24. StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

    cs.LG 2026-05 accept novelty 5.0

    Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.

  25. UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training

    cs.DC 2026-04 unverdicted novelty 5.0

    UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.

  26. Reinforcement Learning Improves LLM Accuracy and Reasoning in Disease Classification from Radiology Reports

    cs.AI 2026-04 unverdicted novelty 5.0

    SFT followed by GRPO improves LLM accuracy and reasoning recall in disease classification from radiology reports on three radiologist-annotated datasets.

  27. HieraSparse: Hierarchical Semi-Structured Sparse KV Attention

    cs.DC 2026-04 unverdicted novelty 5.0

    HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...

  28. Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM

    cs.DC 2026-04 unverdicted novelty 4.0

    A framework combines multi-LoRA runtime switching, multi-stream stylistic decoding, and Dynamic Self-Speculative Decoding with INT4 quantization to achieve 4-6x memory and latency gains for on-device inference of a on...

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · cited by 27 Pith papers · 6 internal anchors

  1. [1]

    The input/output complexity of sorting and related problems

    Alok Aggarwal and S Vitter, Jeffrey. The input/output complexity of sorting and related problems. Communications of the ACM, 31(9):1116–1127, 1988

  2. [2]

    LambdaNetworks: Modeling long-range interactions without attention.arXiv preprint arXiv:2102.08602, 2021

    Irwan Bello. LambdaNetworks: Modeling long-range interactions without attention.arXiv preprint arXiv:2102.08602, 2021

  3. [3]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020

  4. [4]

    An updated set of basic linear algebra subprograms (blas)

    L Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R Clint Whaley, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, et al. An updated set of basic linear algebra subprograms (blas). ACM Transactions on Mathematical Software, 28(2):135–151, 2002

  5. [5]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  6. [6]

    Neural legal judgment prediction in English

    Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. Neural legal judgment prediction in English. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4317–4323, Florence, Italy, 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1424. URL https://www.aclweb.org/anthology/P19-1424

  7. [7]

    Paragraph-level rationale extraction through regularization: A case study on european court of human rights cases

    Ilias Chalkidis, Manos Fergadiotis, Dimitrios Tsarapatsanis, Nikolaos Aletras, Ion Androutsopoulos, and Prodromos Malakasiotis. Paragraph-level rationale extraction through regularization: A case study on european court of human rights cases. InProceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguist...

  8. [8]

    Kernel operations on the gpu, with autodiff, without memory overflows.Journal of Machine Learning Research, 22(74):1–6, 2021

    Benjamin Charlier, Jean Feydy, Joan Alexis Glaunès, François-David Collin, and Ghislain Durif. Kernel operations on the gpu, with autodiff, without memory overflows.Journal of Machine Learning Research, 22(74):1–6, 2021. URL http://jmlr.org/papers/v22/20-275.html

  9. [9]

    Scatterbrain: Unifying sparse and low-rank attention

    Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. Scatterbrain: Unifying sparse and low-rank attention. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

  10. [10]

    Training Deep Nets with Sublinear Memory Cost

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016

  11. [11]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019

  12. [12]

    Rethinking attention with performers

    Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. InInternational Conference on Learning Representations (ICLR), 2020

  13. [13]

    Revisiting transformer-based models for long document classification.arXiv preprint arXiv:2204.06683, 2022

    Xiang Dai, Ilias Chalkidis, Sune Darkner, and Desmond Elliott. Revisiting transformer-based models for long document classification.arXiv preprint arXiv:2204.06683, 2022

  14. [14]

    Transformer-XL: Attentive language models beyond a fixed-length context

    Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, 2019. 11

  15. [15]

    Learning fast algorithms for linear transforms using butterfly factorizations

    Tri Dao, Albert Gu, Matthew Eichhorn, Atri Rudra, and Christopher Ré. Learning fast algorithms for linear transforms using butterfly factorizations. InInternational Conference on Machine Learning (ICML), 2019

  16. [16]

    Kaleidoscope: An efficient, learnable representation for all structured linear maps

    Tri Dao, Nimit Sohoni, Albert Gu, Matthew Eichhorn, Amit Blonder, Megan Leszczynski, Atri Rudra, and Christopher Ré. Kaleidoscope: An efficient, learnable representation for all structured linear maps. In International Conference on Learning Representations (ICLR), 2020

  17. [17]

    Pixelated butterfly: Simple and efficient sparse training for neural network models

    Tri Dao, Beidi Chen, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri Rudra, and Christopher Ré. Pixelated butterfly: Simple and efficient sparse training for neural network models. InInternational Conference on Learning Representations (ICLR), 2022

  18. [18]

    Monarch: Expressive structured matrices for efficient and accurate training

    Tri Dao, Beidi Chen, Nimit Sohoni, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, and Christopher Ré. Monarch: Expressive structured matrices for efficient and accurate training. In International Conference on Machine Learning (ICML), 2022

  19. [19]

    Smyrf-efficient attention using asymmetric clustering.Advances in Neural Information Processing Systems, 33:6476–6489, 2020

    Giannis Daras, Nikita Kitaev, Augustus Odena, and Alexandros G Dimakis. Smyrf-efficient attention using asymmetric clustering.Advances in Neural Information Processing Systems, 33:6476–6489, 2020

  20. [20]

    A two-pronged progress in structured dense matrix vector multiplication

    Christopher De Sa, Albert Gu, Rohan Puttagunta, Christopher Ré, and Atri Rudra. A two-pronged progress in structured dense matrix vector multiplication. InProceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1060–1079. SIAM, 2018

  21. [21]

    The working set model for program behavior.Communications of the ACM, 11(5): 323–333, 1968

    Peter J Denning. The working set model for program behavior.Communications of the ACM, 11(5): 323–333, 1968

  22. [22]

    BERT: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. 2019

  23. [23]

    Learning to prune deep neural networks via layer-wise optimal brain surgeon.arXiv preprint arXiv:1705.07565, 2017

    Xin Dong, Shangyu Chen, and Sinno Jialin Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon.arXiv preprint arXiv:1705.07565, 2017

  24. [24]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2020

  25. [25]

    On a new class of structured matrices.Integral Equations and Operator Theory, 34(3):293–324, 1999

    Y Eidelman and I Gohberg. On a new class of structured matrices.Integral Equations and Operator Theory, 34(3):293–324, 1999

  26. [26]

    Fast geometric learning with symbolic matrices

    Jean Feydy, Joan Glaunès, Benjamin Charlier, and Michael Bronstein. Fast geometric learning with symbolic matrices. Advances in Neural Information Processing Systems, 33, 2020

  27. [27]

    Springer, 2006

    Jörg Flum and Martin Grohe.Parameterized Complexity Theory. Springer, 2006

  28. [28]

    The lottery ticket hypothesis: Finding sparse, trainable neural networks

    Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations, 2018

  29. [29]

    Stabilizing the lottery ticket hypothesis.arXiv preprint arXiv:1903.01611, 2019

    Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Stabilizing the lottery ticket hypothesis.arXiv preprint arXiv:1903.01611, 2019

  30. [30]

    Linear mode connectivity and the lottery ticket hypothesis

    Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. InInternational Conference on Machine Learning, pages 3259–3269. PMLR, 2020

  31. [31]

    It’s raw! audio generation with state-space models

    Karan Goel, Albert Gu, Chris Donahue, and Christopher Ré. It’s raw! audio generation with state-space models. In International Conference on Machine Learning (ICML), 2022

  32. [32]

    Openwebtext corpus, 2019

    Aaron Gokaslan, Vanya Cohen, Pavlick Ellie, and Stefanie Tellex. Openwebtext corpus, 2019. 12

  33. [33]

    Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals.Data mining and knowledge discovery, 1(1):29–53, 1997

    Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals.Data mining and knowledge discovery, 1(1):29–53, 1997

  34. [34]

    SIAM, 2008

    Andreas Griewank and Andrea Walther.Evaluating derivatives: principles and techniques of algorithmic differentiation. SIAM, 2008

  35. [35]

    Hippo: Recurrent memory with optimal polynomial projections

    Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hippo: Recurrent memory with optimal polynomial projections. InAdvances in neural information processing systems (NeurIPS), 2020

  36. [36]

    Combining recurrent, convolutional, and continuous-time models with linear state space layers.Advances in Neural Information Processing Systems, 34, 2021

    Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers.Advances in Neural Information Processing Systems, 34, 2021

  37. [37]

    Efficiently modeling long sequences with structured state spaces

    Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. In The International Conference on Learning Representations (ICLR), 2022

  38. [38]

    Learning both Weights and Connections for Efficient Neural Networks

    Song Han, Jeff Pool, John Tran, and William J Dally. Learning both weights and connections for efficient neural networks.arXiv preprint arXiv:1506.02626, 2015

  39. [39]

    Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding

    Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations, 2016

  40. [40]

    Memory hierarchy design.Computer Architecture: A Quantitative Approach, pages 390–525, 2003

    John Hennessy and David Patterson. Memory hierarchy design.Computer Architecture: A Quantitative Approach, pages 390–525, 2003

  41. [41]

    The hardware lottery

    Sara Hooker. The hardware lottery.arXiv preprint arXiv:2009.06489, 2020

  42. [42]

    Transformer quality in linear time.arXiv preprint arXiv:2202.10447, 2022

    Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc V Le. Transformer quality in linear time.arXiv preprint arXiv:2202.10447, 2022

  43. [43]

    Data movement is all you need: A case study on optimizing transformers.Proceedings of Machine Learning and Systems, 3: 711–732, 2021

    Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Data movement is all you need: A case study on optimizing transformers.Proceedings of Machine Learning and Systems, 3: 711–732, 2021

  44. [44]

    Dissecting the Ampere GPU architecture via microbenchmarking

    Zhe Jia and Peter Van Sandt. Dissecting the Ampere GPU architecture via microbenchmarking. GPU Technology Conference, 2021

  45. [45]

    Scarpazza

    Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. Dissecting the nvidia Volta GPU architecture via microbenchmarking.arXiv preprint arXiv:1804.06826, 2018

  46. [46]

    Dissecting the graphcore IPU architecture via microbenchmarking.arXiv preprint arXiv:1912.03413, 2019

    Zhe Jia, Blake Tillman, Marco Maggioni, and Daniele Paolo Scarpazza. Dissecting the graphcore IPU architecture via microbenchmarking.arXiv preprint arXiv:1912.03413, 2019

  47. [47]

    Mimic-iii, a freely accessible critical care database.Scientific data, 3(1):1–9, 2016

    Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database.Scientific data, 3(1):1–9, 2016

  48. [48]

    In-datacenter performance analysis of a tensor processing unit

    Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. InProceedings of the 44th annual international symposium on computer architecture, pages 1–12, 2017

  49. [49]

    Displacement ranks of matrices and linear equations

    Thomas Kailath, Sun-Yuan Kung, and Martin Morf. Displacement ranks of matrices and linear equations. Journal of Mathematical Analysis and Applications, 68(2):395–407, 1979

  50. [50]

    Transformers are RNNs: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning, pages 5156–5165. PMLR, 2020. 13

  51. [51]

    Reformer: The efficient transformer

    Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. InThe International Conference on Machine Learning (ICML), 2020

  52. [52]

    Albert: A lite BEDRT for self-supervised learning of language representations

    Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite BEDRT for self-supervised learning of language representations. InThe International Conference on Learning Representations (ICLR), 2020

  53. [53]

    The deep learning compiler: A comprehensive survey

    Mingzhen Li, Yi Liu, Xiaoyan Liu, Qingxiao Sun, Xin You, Hailong Yang, Zhongzhi Luan, Lin Gan, Guangwen Yang, and Depei Qian. The deep learning compiler: A comprehensive survey. IEEE Transactions on Parallel and Distributed Systems, 32(3):708–727, 2020

  54. [54]

    Sub-linear memory: How to make performers slim.arXiv preprint arXiv:2012.11346, 2020

    Valerii Likhosherstov, Krzysztof Choromanski, Jared Davis, Xingyou Song, and Adrian Weller. Sub-linear memory: How to make performers slim.arXiv preprint arXiv:2012.11346, 2020

  55. [55]

    Runtime neural pruning

    Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neural pruning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

  56. [56]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

  57. [57]

    Luna: Linear unified nested attention.Advances in Neural Information Processing Systems, 34, 2021

    Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, and Luke Zettlemoyer. Luna: Linear unified nested attention.Advances in Neural Information Processing Systems, 34, 2021

  58. [58]

    Mlperf training benchmark.Proceedings of Machine Learning and Systems, 2:336–349, 2020

    Peter Mattson, Christine Cheng, Gregory Diamos, Cody Coleman, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, et al. Mlperf training benchmark.Proceedings of Machine Learning and Systems, 2:336–349, 2020

  59. [59]

    Scalability! but at whatfCOSTg? In 15th Workshop on Hot Topics in Operating Systems (HotOS XV), 2015

    Frank McSherry, Michael Isard, and Derek G Murray. Scalability! but at whatfCOSTg? In 15th Workshop on Hot Topics in Operating Systems (HotOS XV), 2015

  60. [60]

    Online normalizer calculation for softmax,

    Maxim Milakov and Natalia Gimelshein. Online normalizer calculation for softmax.arXiv preprint arXiv:1805.02867, 2018

  61. [61]

    Nvidia Tesla V100 GPU architecture, 2017

    NVIDIA. Nvidia Tesla V100 GPU architecture, 2017

  62. [62]

    Nvidia A100 tensor core GPU architecture, 2020

    NVIDIA. Nvidia A100 tensor core GPU architecture, 2020

  63. [63]

    Nvidia H100 tensor core GPU architecture, 2022

    NVIDIA. Nvidia H100 tensor core GPU architecture, 2022

  64. [64]

    Random butterfly transformations with applications in computational linear algebra

    D Stott Parker. Random butterfly transformations with applications in computational linear algebra. 1995

  65. [65]

    Pytorch: An imperative style, high- performance deep learning library.Advances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high- performance deep learning library.Advances in neural information processing systems, 32, 2019

  66. [66]

    Self-attention does not needo(n 2)memory.arXiv preprint arXiv:2112.05682,

    Markus N Rabe and Charles Staats. Self-attention does not need 𝑂¹𝑛2º memory. arXiv preprint arXiv:2112.05682, 2021

  67. [67]

    Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

  68. [68]

    Do transformers need deep long-range memory? InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, July 2020

    Jack Rae and Ali Razavi. Do transformers need deep long-range memory? InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, July 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.acl-main.672

  69. [69]

    Compressive trans- formers for long-range sequence modelling

    Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive trans- formers for long-range sequence modelling. InThe International Conference on Learning Representations (ICLR), 2020. 14

  70. [70]

    Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines.Acm Sigplan Notices, 48(6):519–530, 2013

    Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines.Acm Sigplan Notices, 48(6):519–530, 2013

  71. [71]

    McGraw-Hill New York, 2003

    Raghu Ramakrishnan, Johannes Gehrke, and Johannes Gehrke.Database management systems, volume 3. McGraw-Hill New York, 2003

  72. [72]

    Parallel stochastic gradient algorithms for large-scale matrix completion

    Benjamin Recht and Christopher Ré. Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, 5(2):201–226, 2013

  73. [73]

    Combiner: Full attention transformer with sparse computation cost.Advances in Neural Information Processing Systems, 34, 2021

    Hongyu Ren, Hanjun Dai, Zihang Dai, Mengjiao Yang, Jure Leskovec, Dale Schuurmans, and Bo Dai. Combiner: Full attention transformer with sparse computation cost.Advances in Neural Information Processing Systems, 34, 2021

  74. [74]

    Efficient content-based sparse attention with routing transformers.Transactions of the Association for Computational Linguistics, 9: 53–68, 2021

    Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers.Transactions of the Association for Computational Linguistics, 9: 53–68, 2021

  75. [75]

    XLA: Compiling machine learning for peak performance

    Amit Sabne. XLA: Compiling machine learning for peak performance. 2020

  76. [76]

    Movement pruning: Adaptive sparsity by fine-tuning

    Victor Sanh, Thomas Wolf, and Alexander M Rush. Movement pruning: Adaptive sparsity by fine-tuning. arXiv preprint arXiv:2005.07683, 2020

  77. [77]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

  78. [78]

    Structured transforms for small-footprint deep learning

    Vikas Sindhwani, Tara Sainath, and Sanjiv Kumar. Structured transforms for small-footprint deep learning. In Advances in Neural Information Processing Systems, pages 3088–3096, 2015

  79. [79]

    Adaptive attention span in transformers

    Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. Adaptive attention span in transformers. InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2019

  80. [80]

    Long range arena: A benchmark for efficient transformers

    Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena: A benchmark for efficient transformers. In International Conference on Learning Representations, 2020

Showing first 80 references.