arxiv: 2001.04451 · v2 · submitted 2020-01-13 · 💻 cs.LG · cs.CL· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Reformer: The Efficient Transformer

Anselm Levskaya, {\L}ukasz Kaiser, Nikita Kitaev

Pith reviewed 2026-05-13 07:14 UTC · model grok-4.3

classification 💻 cs.LG cs.CLstat.ML

keywords reformertransformerlocality-sensitive hashingreversible residual layersefficient attentionlong sequencesmemory efficiencysequence modeling

0 comments

The pith

The Reformer matches standard Transformer performance on long sequences while using much less memory and running faster.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large Transformer models become prohibitively expensive to train on long sequences because attention costs grow quadratically with length and activations must be stored once per layer. The paper replaces dot-product attention with a locality-sensitive hashing version whose cost scales as O(L log L) instead of O(L squared). It also switches to reversible residual layers so that activations are stored only once during training rather than once for each of the N layers. The resulting Reformer achieves comparable accuracy to a standard Transformer yet requires far less memory and runs faster when sequences are long. A reader would care because this change makes it practical to apply the same powerful sequence modeling approach to inputs that were previously too large to handle.

Core claim

The Reformer achieves performance on par with Transformer models by using locality-sensitive hashing for attention, changing its complexity from O(L squared) to O(L log L), and reversible residual layers that allow storing activations only once instead of N times where N is the number of layers, resulting in a model that is much more memory-efficient and faster on long sequences.

What carries the argument

Locality-sensitive hashing attention combined with reversible residual layers, which approximate exact attention at lower cost and eliminate the need to store separate activations for every layer.

If this is right

The model can handle sequences much longer than those feasible with standard Transformers without exceeding available memory.
Training time on long sequences decreases because attention computation drops from quadratic to linearithmic.
Performance on typical sequence tasks stays comparable to that of exact-attention Transformers.
Memory use for activations stays constant regardless of how many layers are stacked.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hashing idea could be applied to other quadratic operations inside Transformers to gain further speedups.
Tasks that demand very fine-grained attention might require more hash buckets or rounds to keep quality close to the exact version.
The reversible-layer trick could be paired with other memory-saving methods such as checkpointing to push sequence lengths even higher.

Load-bearing premise

The locality-sensitive hashing approximation to attention preserves enough accuracy that downstream task performance remains comparable to exact attention.

What would settle it

A side-by-side experiment on a long-sequence task in which the Reformer achieves noticeably lower accuracy than a standard Transformer of the same size and depth would show that the hashing approximation does not preserve sufficient accuracy.

read the original abstract

Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L\log L$), where $L$ is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of $N$ times, where $N$ is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Reformer shows LSH attention plus reversible residuals can match standard Transformer performance on long sequences while cutting time and memory costs, and the experiments largely support the claims.

read the letter

The core takeaway is that this paper gives a workable recipe for scaling Transformers to longer sequences without the usual quadratic blowup. They swap dot-product attention for a locality-sensitive hashing version that brings complexity down to O(L log L) and replace standard residuals with reversible ones so memory stays constant across layers instead of growing with depth. Both changes are straightforward once described, and the resulting model stays competitive on the benchmarks they run.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Reformer, a Transformer variant that replaces standard dot-product attention with locality-sensitive hashing (LSH) attention to reduce complexity from O(L²) to O(L log L) and uses reversible residual layers to store activations only once during training instead of N times. The central claim is that the resulting model achieves performance on par with standard Transformers while being substantially more memory-efficient and faster on long sequences.

Significance. If the empirical parity holds, the work is significant for enabling Transformer-scale models on longer sequences in domains such as long-document NLP, music, and genomics. The reversible-residual technique is exact and provides a clean memory reduction; the LSH approximation supplies the asymptotic speedup. Together they form a practical, reproducible recipe that has already influenced subsequent efficient-attention designs.

major comments (2)

[Section 3 (LSH Attention)] LSH attention mechanism (Section 3): the headline claim of performance parity rests on the assertion that LSH with the reported bucket counts and hash rounds produces attention outputs sufficiently close to exact dot-product attention. No quantitative bound on approximation error, no analysis of collision probability under sparse or long-range attention patterns, and no ablation varying the number of rounds or buckets are provided; this leaves the weakest assumption untested and load-bearing for the central result.
[Section 4 (Experiments)] Experimental evaluation (Section 4 and Tables 1-3): while parity is asserted on several benchmarks, the manuscript does not report the exact LSH hyperparameters used per task or compare against a strong exact-attention baseline with identical optimization settings. Without these controls it is impossible to isolate whether observed differences (or lack thereof) are due to the approximation or to other implementation choices.

minor comments (2)

[Section 3] Notation for the number of hash rounds and buckets is introduced without a compact summary table; adding one would improve readability when readers compare configurations across experiments.
[Figure 2] Figure captions for the complexity and memory plots could explicitly state the sequence lengths at which the O(L log L) and single-activation regimes become advantageous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We address each major comment below and commit to revisions that strengthen the empirical validation while noting the inherent challenges in providing tight theoretical bounds.

read point-by-point responses

Referee: [Section 3 (LSH Attention)] LSH attention mechanism (Section 3): the headline claim of performance parity rests on the assertion that LSH with the reported bucket counts and hash rounds produces attention outputs sufficiently close to exact dot-product attention. No quantitative bound on approximation error, no analysis of collision probability under sparse or long-range attention patterns, and no ablation varying the number of rounds or buckets are provided; this leaves the weakest assumption untested and load-bearing for the central result.

Authors: We acknowledge that the manuscript does not include a quantitative theoretical bound on the LSH approximation error or a dedicated analysis of collision probabilities for sparse/long-range patterns. Deriving tight, data-independent bounds is difficult because the error depends on the distribution of attention scores. The central results instead rest on consistent empirical parity across tasks. We will add an ablation study varying the number of hash rounds and buckets (and report the resulting performance and memory trade-offs) to the revised manuscript. We will also include a short discussion referencing standard LSH collision-probability results from the literature to clarify the probabilistic nature of the approximation. revision: partial
Referee: [Section 4 (Experiments)] Experimental evaluation (Section 4 and Tables 1-3): while parity is asserted on several benchmarks, the manuscript does not report the exact LSH hyperparameters used per task or compare against a strong exact-attention baseline with identical optimization settings. Without these controls it is impossible to isolate whether observed differences (or lack thereof) are due to the approximation or to other implementation choices.

Authors: We will add a supplementary table listing the precise LSH hyperparameters (buckets, hash rounds, chunk size, etc.) used for every reported experiment. Regarding baselines, standard Transformer comparisons were performed with matching optimizer, learning-rate schedule, and initialization wherever memory allowed. For sequences longer than a few thousand tokens, exact attention becomes infeasible on the same hardware, which is the core motivation of the work. We will add controlled side-by-side runs on shorter sequences (lengths where exact attention fits) using identical optimization settings to isolate the effect of the LSH approximation. revision: yes

Circularity Check

0 steps flagged

No circularity: Reformer introduces independent algorithmic mechanisms

full rationale

The paper's core contributions are two explicit algorithmic replacements: LSH-based attention (reducing complexity from O(L²) to O(L log L)) and reversible residual layers (storing activations once instead of N times). Neither is derived from fitted parameters, self-referential equations, or load-bearing self-citations; both are presented as new constructions whose correctness is evaluated via standard benchmark comparisons. The LSH approximation and reversibility are defined directly in the text without reducing to prior outputs of the same paper. No self-definitional loops, renamed empirical patterns, or uniqueness theorems imported from the authors' own prior work appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on the standard mathematical properties of locality-sensitive hashing for approximate nearest-neighbor search and on the invertibility property of reversible residual blocks; no new free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Locality-sensitive hashing can be used to approximate dot-product attention with high probability
Invoked to justify replacing exact attention with the LSH variant
standard math Reversible residual layers allow exact reconstruction of activations from later layers
Used to claim memory savings without loss of information

pith-pipeline@v0.9.0 · 5425 in / 1208 out tokens · 38996 ms · 2026-05-13T07:14:15.536184+00:00 · methodology

discussion (0)

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals
cs.CR 2026-05 unverdicted novelty 7.0

TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.
Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling
cs.LG 2026-04 unverdicted novelty 7.0

Stream-CQSA uses CQS-based decomposition to stream exact attention computations for billion-token sequences on limited-memory hardware.
Cross-Stage Attention Propagation for Efficient Semantic Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

CSAP computes attention at the deepest scale and propagates the maps to shallower stages, bypassing per-scale query-key computations to cut decoder FLOPs while preserving multi-scale performance and beating SegNeXt-Ti...
Fast Cross-Operator Optimization of Attention Dataflow
cs.AR 2026-04 unverdicted novelty 7.0

MMEE encodes dataflow decisions in matrix form for fast exhaustive search, delivering 40-69% lower latency and energy use than prior methods while running 64-343x faster.
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
cs.CL 2024-10 unverdicted novelty 7.0

LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.
RWKV: Reinventing RNNs for the Transformer Era
cs.CL 2023-05 unverdicted novelty 7.0

RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
cs.LG 2021-01 accept novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
Elastic Attention Cores for Scalable Vision Transformers
cs.CV 2026-05 unverdicted novelty 6.0

VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...
Spectral Transformer Neural Processes
cs.LG 2026-05 unverdicted novelty 6.0

STNPs extend TNPs with a spectral aggregator that estimates context spectra, forms spectral mixtures, and injects task-adaptive frequency features to better handle periodicity.
The Impossibility Triangle of Long-Context Modeling
cs.CL 2026-05 unverdicted novelty 6.0

No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models
cs.LG 2026-04 unverdicted novelty 6.0

HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing
cs.CL 2026-04 unverdicted novelty 6.0

DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.
GAMMA-Net: Adaptive Long-Horizon Traffic Spatio-Temporal Forecasting Model based on Interleaved Graph Attention and Multi-Axis Mamba
cs.AI 2026-04 unverdicted novelty 6.0

GAMMA-Net combines Graph Attention Networks and multi-axis Mamba to outperform prior models in long-horizon traffic forecasting, with up to 16.25% lower MAE on benchmarks like METR-LA and PEMS datasets.
Kimi Linear: An Expressive, Efficient Attention Architecture
cs.CL 2025-10 unverdicted novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
MemGPT: Towards LLMs as Operating Systems
cs.AI 2023-10 unverdicted novelty 6.0

MemGPT uses OS-inspired virtual context management to extend LLM context windows for large document analysis and long-term multi-session chat.
PaLM: Scaling Language Modeling with Pathways
cs.CL 2022-04 accept novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
ST-MoE: Designing Stable and Transferable Sparse Expert Models
cs.CL 2022-02 unverdicted novelty 6.0

ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
HuggingFace's Transformers: State-of-the-art Natural Language Processing
cs.CL 2019-10 accept novelty 6.0

Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.
Kaczmarz Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
cs.LG 2026-05 accept novelty 5.0

Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
Learning Fingerprints for Medical Time Series with Redundancy-Constrained Information Maximization
cs.LG 2026-04 unverdicted novelty 5.0

A self-supervised method learns a fixed set of disentangled fingerprint tokens from medical time series by combining reconstruction loss with a total coding rate diversity penalty, framed as a disentangled rate-distor...
Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models
cs.LG 2026-04 unverdicted novelty 5.0

Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.
climt-paraformer: Stable Emulation of Convective Parameterization using a Temporal Memory-aware Transformer
physics.ao-ph 2026-04 unverdicted novelty 5.0

A temporal memory-aware Transformer emulator for the Emanuel convective parameterization shows lower offline errors and 10-year stability in single-column model tests compared to memory-less MLP and LSTM baselines.
PestVL-Net: Enabling Multimodal Pest Learning via Fine-grained Vision-Language Interaction
cs.CV 2026-04 unverdicted novelty 5.0

PestVL-Net combines an RWKV visual backbone with saliency-guided window partitioning and MLLM-derived linguistic priors via multimodal chain-of-thought to enable fine-grained multimodal pest recognition on dedicated datasets.
MedMamba: Recasting Mamba for Medical Time Series Classification
eess.SP 2026-04 unverdicted novelty 5.0

MedMamba introduces a principle-guided bidirectional multi-scale Mamba model that outperforms prior methods on EEG, ECG, and activity classification benchmarks while delivering 4.6x inference speedup.
Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference
cs.DC 2026-03 unverdicted novelty 5.0

Unifying LLM memory optimizations into a Prepare-Compute-Retrieve-Apply pipeline and accelerating it on GPU-FPGA hardware yields up to 2.2x faster inference and 4.7x less energy than GPU-only baselines.
RAFNet: Region-Aware Fusion Network for Pansharpening
cs.CV 2026-05 unverdicted novelty 4.0

RAFNet uses wavelet-based directional separation, K-means regional clustering, and clustered sparse attention to create adaptive kernels and efficient frequency aggregation, outperforming prior pansharpening networks ...

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 27 Pith papers · 3 internal anchors

[2]

Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya P

URL http: //arxiv.org/abs/1808.04444. Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya P. Razenshteyn, and Ludwig Schmidt. Practical and optimal LSH for angular distance. CoRR, abs/1509.02897,

work page arXiv
[3]

Practicalandoptimal lsh for angular distance,

URL http://arxiv. org/abs/1509.02897. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450,

work page arXiv
[4]

Layer Normalization

URL http://arxiv.org/abs/1607.06450. Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. Large-scale simple question answering with memory networks. CoRR, abs/1506.02075,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Sarath Chandar, Sungjin Ahn, Hugo Larochelle, Pascal Vincent, Gerald Tesauro, and Yoshua Ben- gio

URL http://arxiv.org/ abs/1506.02075. Sarath Chandar, Sungjin Ahn, Hugo Larochelle, Pascal Vincent, Gerald Tesauro, and Yoshua Ben- gio. Hierarchical memory networks. arXiv preprint arXiv:1605.07427,

work page arXiv
[7]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

URL http://arxiv.org/abs/1810.04805. Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual net- work: Backpropagation without storing activations. InAdvances in neural information processing systems, pp. 2214–2224,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Curtis Hawthorne, An- drew M Dai, Matthew D Hoffman, and Douglas Eck

URL http://arxiv.org/abs/1511.02301. Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Curtis Hawthorne, An- drew M Dai, Matthew D Hoffman, and Douglas Eck. Music transformer: Generating music with long-term structure. arXiv preprint arXiv:1809.04281,

work page arXiv
[11]

URL http: //arxiv.org/abs/1907.05242. Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. CoRR, abs/1801.10198,

work page arXiv 1907
[12]

Generating Wikipedia by summarizing long sequences

URL http://arxiv.org/abs/1801.10198. Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine trans- lation. In Proceedings of the Third Conference on Machine Translation: Research Papers , pp. 1–9, Brussels, Belgium, October

work page arXiv
[13]

doi: 10.18653/v1/W18-6301

Association for Computational Linguistics. doi: 10.18653/v1/W18-6301. URL https://www.aclweb.org/anthology/W18-6301. 10 Published as a conference paper at ICLR 2020 Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, and Alexander Ku. Image transformer. CoRR, abs/1802.05751,

work page doi:10.18653/v1/w18-6301 2020
[15]

Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy P

URL http: //arxiv.org/abs/1906.05909. Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy P. Lillicrap. One- shot learning with memory-augmented neural networks. CoRR, abs/1605.06065,

work page arXiv 1906
[16]

Noam Shazeer and Mitchell Stern

URL http://arxiv.org/abs/1605.06065. Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. CoRR, abs/1804.04235,

work page arXiv
[17]

URL http://arxiv.org/abs/1804.04235. Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake Hecht- man. Mesh-tensorﬂow: Deep learning for supercomputers. CoRR, abs/1811.02084,

work page arXiv
[18]

Nimit Sharad Sohoni, Christopher Richard Aberger, Megan Leszczynski, Jian Zhang, and Christo- pher R´e

URL http://arxiv.org/abs/1811.02084. Nimit Sharad Sohoni, Christopher Richard Aberger, Megan Leszczynski, Jian Zhang, and Christo- pher R´e. Low-memory neural network training: A technical report.CoRR, abs/1904.10631,

work page arXiv 1904
[19]

Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin

URL http://arxiv.org/abs/1904.10631. Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. Adaptive atten- tion span in transformers. CoRR, abs/1905.07799, 2019a. URL http://arxiv.org/abs/ 1905.07799. Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herv´e J´egou, and Armand Joulin. Aug- menting self-attention with persistent mem...

work page arXiv 1904
[20]

Attention Is All You Need

URL http: //arxiv.org/abs/1706.03762. Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks.CoRR, abs/1410.3916,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

11 Published as a conference paper at ICLR 2020 A M ULTI-ROUND LSH A TTENTION In this section we describe in more detail the multi-hash version of our LSH attention mechanism

URL http://arxiv.org/abs/1410.3916. 11 Published as a conference paper at ICLR 2020 A M ULTI-ROUND LSH A TTENTION In this section we describe in more detail the multi-hash version of our LSH attention mechanism. We ﬁrst repeat Equation (3) from the main text, which describes a general formulation of attention with sparsity: oi = ∑ j∈ ˜Pi exp (qi· kj− m(j,...

work page arXiv 2020