Recognition: 2 theorem links
· Lean TheoremReformer: The Efficient Transformer
Pith reviewed 2026-05-13 07:14 UTC · model grok-4.3
The pith
The Reformer matches standard Transformer performance on long sequences while using much less memory and running faster.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Reformer achieves performance on par with Transformer models by using locality-sensitive hashing for attention, changing its complexity from O(L squared) to O(L log L), and reversible residual layers that allow storing activations only once instead of N times where N is the number of layers, resulting in a model that is much more memory-efficient and faster on long sequences.
What carries the argument
Locality-sensitive hashing attention combined with reversible residual layers, which approximate exact attention at lower cost and eliminate the need to store separate activations for every layer.
If this is right
- The model can handle sequences much longer than those feasible with standard Transformers without exceeding available memory.
- Training time on long sequences decreases because attention computation drops from quadratic to linearithmic.
- Performance on typical sequence tasks stays comparable to that of exact-attention Transformers.
- Memory use for activations stays constant regardless of how many layers are stacked.
Where Pith is reading between the lines
- The same hashing idea could be applied to other quadratic operations inside Transformers to gain further speedups.
- Tasks that demand very fine-grained attention might require more hash buckets or rounds to keep quality close to the exact version.
- The reversible-layer trick could be paired with other memory-saving methods such as checkpointing to push sequence lengths even higher.
Load-bearing premise
The locality-sensitive hashing approximation to attention preserves enough accuracy that downstream task performance remains comparable to exact attention.
What would settle it
A side-by-side experiment on a long-sequence task in which the Reformer achieves noticeably lower accuracy than a standard Transformer of the same size and depth would show that the hashing approximation does not preserve sufficient accuracy.
read the original abstract
Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L\log L$), where $L$ is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of $N$ times, where $N$ is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Reformer, a Transformer variant that replaces standard dot-product attention with locality-sensitive hashing (LSH) attention to reduce complexity from O(L²) to O(L log L) and uses reversible residual layers to store activations only once during training instead of N times. The central claim is that the resulting model achieves performance on par with standard Transformers while being substantially more memory-efficient and faster on long sequences.
Significance. If the empirical parity holds, the work is significant for enabling Transformer-scale models on longer sequences in domains such as long-document NLP, music, and genomics. The reversible-residual technique is exact and provides a clean memory reduction; the LSH approximation supplies the asymptotic speedup. Together they form a practical, reproducible recipe that has already influenced subsequent efficient-attention designs.
major comments (2)
- [Section 3 (LSH Attention)] LSH attention mechanism (Section 3): the headline claim of performance parity rests on the assertion that LSH with the reported bucket counts and hash rounds produces attention outputs sufficiently close to exact dot-product attention. No quantitative bound on approximation error, no analysis of collision probability under sparse or long-range attention patterns, and no ablation varying the number of rounds or buckets are provided; this leaves the weakest assumption untested and load-bearing for the central result.
- [Section 4 (Experiments)] Experimental evaluation (Section 4 and Tables 1-3): while parity is asserted on several benchmarks, the manuscript does not report the exact LSH hyperparameters used per task or compare against a strong exact-attention baseline with identical optimization settings. Without these controls it is impossible to isolate whether observed differences (or lack thereof) are due to the approximation or to other implementation choices.
minor comments (2)
- [Section 3] Notation for the number of hash rounds and buckets is introduced without a compact summary table; adding one would improve readability when readers compare configurations across experiments.
- [Figure 2] Figure captions for the complexity and memory plots could explicitly state the sequence lengths at which the O(L log L) and single-activation regimes become advantageous.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the work's significance. We address each major comment below and commit to revisions that strengthen the empirical validation while noting the inherent challenges in providing tight theoretical bounds.
read point-by-point responses
-
Referee: [Section 3 (LSH Attention)] LSH attention mechanism (Section 3): the headline claim of performance parity rests on the assertion that LSH with the reported bucket counts and hash rounds produces attention outputs sufficiently close to exact dot-product attention. No quantitative bound on approximation error, no analysis of collision probability under sparse or long-range attention patterns, and no ablation varying the number of rounds or buckets are provided; this leaves the weakest assumption untested and load-bearing for the central result.
Authors: We acknowledge that the manuscript does not include a quantitative theoretical bound on the LSH approximation error or a dedicated analysis of collision probabilities for sparse/long-range patterns. Deriving tight, data-independent bounds is difficult because the error depends on the distribution of attention scores. The central results instead rest on consistent empirical parity across tasks. We will add an ablation study varying the number of hash rounds and buckets (and report the resulting performance and memory trade-offs) to the revised manuscript. We will also include a short discussion referencing standard LSH collision-probability results from the literature to clarify the probabilistic nature of the approximation. revision: partial
-
Referee: [Section 4 (Experiments)] Experimental evaluation (Section 4 and Tables 1-3): while parity is asserted on several benchmarks, the manuscript does not report the exact LSH hyperparameters used per task or compare against a strong exact-attention baseline with identical optimization settings. Without these controls it is impossible to isolate whether observed differences (or lack thereof) are due to the approximation or to other implementation choices.
Authors: We will add a supplementary table listing the precise LSH hyperparameters (buckets, hash rounds, chunk size, etc.) used for every reported experiment. Regarding baselines, standard Transformer comparisons were performed with matching optimizer, learning-rate schedule, and initialization wherever memory allowed. For sequences longer than a few thousand tokens, exact attention becomes infeasible on the same hardware, which is the core motivation of the work. We will add controlled side-by-side runs on shorter sequences (lengths where exact attention fits) using identical optimization settings to isolate the effect of the LSH approximation. revision: yes
Circularity Check
No circularity: Reformer introduces independent algorithmic mechanisms
full rationale
The paper's core contributions are two explicit algorithmic replacements: LSH-based attention (reducing complexity from O(L²) to O(L log L)) and reversible residual layers (storing activations once instead of N times). Neither is derived from fitted parameters, self-referential equations, or load-bearing self-citations; both are presented as new constructions whose correctness is evaluated via standard benchmark comparisons. The LSH approximation and reversibility are defined directly in the text without reducing to prior outputs of the same paper. No self-definitional loops, renamed empirical patterns, or uniqueness theorems imported from the authors' own prior work appear in the derivation chain.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Locality-sensitive hashing can be used to approximate dot-product attention with high probability
- standard math Reversible residual layers allow exact reconstruction of activations from later layers
Forward citations
Cited by 27 Pith papers
-
TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals
TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.
-
Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling
Stream-CQSA uses CQS-based decomposition to stream exact attention computations for billion-token sequences on limited-memory hardware.
-
Cross-Stage Attention Propagation for Efficient Semantic Segmentation
CSAP computes attention at the deepest scale and propagates the maps to shallower stages, bypassing per-scale query-key computations to cut decoder FLOPs while preserving multi-scale performance and beating SegNeXt-Ti...
-
Fast Cross-Operator Optimization of Attention Dataflow
MMEE encodes dataflow decisions in matrix form for fast exhaustive search, delivering 40-69% lower latency and energy use than prior methods while running 64-343x faster.
-
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.
-
RWKV: Reinventing RNNs for the Transformer Era
RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.
-
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
-
Elastic Attention Cores for Scalable Vision Transformers
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...
-
Spectral Transformer Neural Processes
STNPs extend TNPs with a spectral aggregator that estimates context spectra, forms spectral mixtures, and injects task-adaptive frequency features to better handle periodicity.
-
The Impossibility Triangle of Long-Context Modeling
No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
-
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models
HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
-
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing
DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.
-
GAMMA-Net: Adaptive Long-Horizon Traffic Spatio-Temporal Forecasting Model based on Interleaved Graph Attention and Multi-Axis Mamba
GAMMA-Net combines Graph Attention Networks and multi-axis Mamba to outperform prior models in long-horizon traffic forecasting, with up to 16.25% lower MAE on benchmarks like METR-LA and PEMS datasets.
-
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
-
MemGPT: Towards LLMs as Operating Systems
MemGPT uses OS-inspired virtual context management to extend LLM context windows for large document analysis and long-term multi-session chat.
-
PaLM: Scaling Language Modeling with Pathways
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
-
ST-MoE: Designing Stable and Transferable Sparse Expert Models
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
-
HuggingFace's Transformers: State-of-the-art Natural Language Processing
Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.
-
Kaczmarz Linear Attention
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
-
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
-
Learning Fingerprints for Medical Time Series with Redundancy-Constrained Information Maximization
A self-supervised method learns a fixed set of disentangled fingerprint tokens from medical time series by combining reconstruction loss with a total coding rate diversity penalty, framed as a disentangled rate-distor...
-
Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models
Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.
-
climt-paraformer: Stable Emulation of Convective Parameterization using a Temporal Memory-aware Transformer
A temporal memory-aware Transformer emulator for the Emanuel convective parameterization shows lower offline errors and 10-year stability in single-column model tests compared to memory-less MLP and LSTM baselines.
-
PestVL-Net: Enabling Multimodal Pest Learning via Fine-grained Vision-Language Interaction
PestVL-Net combines an RWKV visual backbone with saliency-guided window partitioning and MLLM-derived linguistic priors via multimodal chain-of-thought to enable fine-grained multimodal pest recognition on dedicated datasets.
-
MedMamba: Recasting Mamba for Medical Time Series Classification
MedMamba introduces a principle-guided bidirectional multi-scale Mamba model that outperforms prior methods on EEG, ECG, and activity classification benchmarks while delivering 4.6x inference speedup.
-
Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference
Unifying LLM memory optimizations into a Prepare-Compute-Retrieve-Apply pipeline and accelerating it on GPU-FPGA hardware yields up to 2.2x faster inference and 4.7x less energy than GPU-only baselines.
-
RAFNet: Region-Aware Fusion Network for Pansharpening
RAFNet uses wavelet-based directional separation, K-means regional clustering, and clustered sparse attention to create adaptive kernels and efficient frequency aggregation, outperforming prior pansharpening networks ...
Reference graph
Works this paper leans on
-
[2]
Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya P
URL http: //arxiv.org/abs/1808.04444. Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya P. Razenshteyn, and Ludwig Schmidt. Practical and optimal LSH for angular distance. CoRR, abs/1509.02897,
-
[3]
Practicalandoptimal lsh for angular distance,
URL http://arxiv. org/abs/1509.02897. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450,
-
[4]
URL http://arxiv.org/abs/1607.06450. Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. Large-scale simple question answering with memory networks. CoRR, abs/1506.02075,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Sarath Chandar, Sungjin Ahn, Hugo Larochelle, Pascal Vincent, Gerald Tesauro, and Yoshua Ben- gio
URL http://arxiv.org/ abs/1506.02075. Sarath Chandar, Sungjin Ahn, Hugo Larochelle, Pascal Vincent, Gerald Tesauro, and Yoshua Ben- gio. Hierarchical memory networks. arXiv preprint arXiv:1605.07427,
-
[7]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
URL http://arxiv.org/abs/1810.04805. Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual net- work: Backpropagation without storing activations. InAdvances in neural information processing systems, pp. 2214–2224,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
URL http://arxiv.org/abs/1511.02301. Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Curtis Hawthorne, An- drew M Dai, Matthew D Hoffman, and Douglas Eck. Music transformer: Generating music with long-term structure. arXiv preprint arXiv:1809.04281,
- [11]
-
[12]
Generating Wikipedia by summarizing long sequences
URL http://arxiv.org/abs/1801.10198. Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine trans- lation. In Proceedings of the Third Conference on Machine Translation: Research Papers , pp. 1–9, Brussels, Belgium, October
-
[13]
Association for Computational Linguistics. doi: 10.18653/v1/W18-6301. URL https://www.aclweb.org/anthology/W18-6301. 10 Published as a conference paper at ICLR 2020 Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, and Alexander Ku. Image transformer. CoRR, abs/1802.05751,
-
[15]
Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy P
URL http: //arxiv.org/abs/1906.05909. Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy P. Lillicrap. One- shot learning with memory-augmented neural networks. CoRR, abs/1605.06065,
-
[16]
Noam Shazeer and Mitchell Stern
URL http://arxiv.org/abs/1605.06065. Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. CoRR, abs/1804.04235,
-
[17]
URL http://arxiv.org/abs/1804.04235. Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake Hecht- man. Mesh-tensorflow: Deep learning for supercomputers. CoRR, abs/1811.02084,
-
[18]
URL http://arxiv.org/abs/1811.02084. Nimit Sharad Sohoni, Christopher Richard Aberger, Megan Leszczynski, Jian Zhang, and Christo- pher R´e. Low-memory neural network training: A technical report.CoRR, abs/1904.10631,
-
[19]
Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin
URL http://arxiv.org/abs/1904.10631. Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. Adaptive atten- tion span in transformers. CoRR, abs/1905.07799, 2019a. URL http://arxiv.org/abs/ 1905.07799. Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herv´e J´egou, and Armand Joulin. Aug- menting self-attention with persistent mem...
-
[20]
URL http: //arxiv.org/abs/1706.03762. Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks.CoRR, abs/1410.3916,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
URL http://arxiv.org/abs/1410.3916. 11 Published as a conference paper at ICLR 2020 A M ULTI-ROUND LSH A TTENTION In this section we describe in more detail the multi-hash version of our LSH attention mechanism. We first repeat Equation (3) from the main text, which describes a general formulation of attention with sparsity: oi = ∑ j∈ ˜Pi exp (qi· kj− m(j,...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.