arxiv: 2605.09472 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.DS

Recognition: 2 theorem links

· Lean Theorem

Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases

Daniel Wolfson, Tal Wagner

Pith reviewed 2026-05-12 04:24 UTC · model grok-4.3

classification 💻 cs.LG cs.DS

keywords ALiBipositional LSHattention approximationlinear biasblock matrixlong-context transformerslocality-sensitive hashingbinary masks

0 comments

The pith

The ALiBi bias matrix equals the expectation of random contiguous block-diagonal binary masks from a positional LSH scheme.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the ALiBi positional bias in attention arises exactly as the average of certain random binary matrices. These matrices come from a hashing procedure that creates block patterns along the sequence positions. Sampling and averaging a modest number of such matrices produces approximations with explicit error bounds that hold for every possible query, key, and value at once. This turns the quadratic cost of long-context ALiBi attention into a near-linear collection of short, ordinary attention steps. A reader would care because the same construction also supplies a single formal language that treats positional biases, masks, and embeddings together.

Core claim

The ALiBi bias matrix is the expectation of contiguous block-diagonal binary masks induced by a positional LSH scheme. The empirical mean of masks sampled from this scheme yields spectral norm and max-norm approximation guarantees with bounded block sizes with high probability. This structural theorem implies a uniform approximation theorem for ALiBi-biased attention: with high probability over the sampled masks, the approximate attention output is accurate simultaneously for all query-key-value inputs and can be computed in near-linear time in the context length, reducing long-context ALiBi to a collection of randomized short-context regular attention operations.

What carries the argument

positional LSH scheme that generates contiguous block-diagonal binary masks whose expectation is exactly the ALiBi bias matrix

If this is right

ALiBi attention becomes uniformly accurate for all inputs once enough random masks are averaged.
Total runtime scales near-linearly with context length rather than quadratically.
Each long sequence reduces to several independent short-context unbiased attention computations.
Spectral and max-norm error bounds hold with high probability for any fixed block size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hashing view might yield approximations for other positional bias families beyond ALiBi.
Practical code could combine this sampling with existing sparse-attention kernels to cut memory further.
Empirical tests on models of increasing size would show whether the high-probability bounds produce measurable wall-clock gains.

Load-bearing premise

A positional LSH scheme must exist whose induced masks have expectation exactly equal to the ALiBi bias matrix, and the random samples must approximate it uniformly for every query-key-value triple without input-dependent tuning.

What would settle it

Generate many masks from the positional LSH scheme, form their empirical mean, and test whether the spectral-norm distance to the true ALiBi matrix stays below the claimed bound as block size grows; alternatively, measure whether the attention output error for arbitrary query-key-value triples exceeds the predicted uniform bound.

Figures

Figures reproduced from arXiv: 2605.09472 by Daniel Wolfson, Tal Wagner.

**Figure 1.** Figure 1: ALiBi approximation with binary block-diagonal matrices by positional LSH. 3 Positional LSH In this section we present the positional LSH framework and prove Theorem 2.1. 3.1 Attention with Bias via Locality Sensitive Hashing Let 𝑈 be a space of objects and ker ∶ 𝑈 × 𝑈 → [0, 1] a similarity map (i.e., a kernel) over 𝑈. An LSH scheme for ker is a distribution  over hash functions operating on 𝑈 such that f… view at source ↗

**Figure 2.** Figure 2: ‖𝐿⋆ − 𝑀̃‖, ‖𝐿⋆ − 𝑀̃‖max and ‖𝑇 ⋆ − 𝑇̃⋆‖max as functions of the sample size 𝑠, for four attention heads with various values of 𝜎, for the models Llama-4-Scout-17B-16E and Mistral-7B. Both axes show values on a log scale. Shaded regions are mean±std across inputs. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

read the original abstract

Positional encoding in transformers is commonly implemented through positional embeddings, attention masks, or bias terms, but formal connections between these mechanisms remain limited. We study attention with positional bias through the lens of locality-sensitive hashing (LSH), focusing on Attention with Linear Biases (ALiBi). We show that the ALiBi bias matrix is the expectation of contiguous block-diagonal binary masks induced by a ``positional LSH'' scheme. The empirical mean of masks sampled from this scheme yields spectral norm and max-norm approximation guarantees with bounded block sizes with high probability. This structural theorem implies a uniform approximation theorem for ALiBi-biased attention: with high probability over the sampled masks, the approximate attention output is accurate simultaneously for all query-key-value inputs and can be computed in near-linear time in the context length, reducing long-context ALiBi to a collection of randomized short-context regular (positionally unbiased) attention operations. Conceptually, this connects positional bias, masks, and positional embeddings in a single formal framework and suggests an approach to efficient ALiBi-biased attention. Experiments on large language models validate our theoretical findings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The central claim that ALiBi bias equals the expectation of binary masks is impossible on its face due to the negative unbounded range of the bias versus the [0,1] range of mask expectations.

read the letter

The paper's core move is to define a positional LSH that produces contiguous block-diagonal binary masks whose average recovers the ALiBi structure, then use concentration to get spectral and max-norm approximation bounds that let you replace long-context ALiBi attention with a few short-context unbiased calls. That framing is new; nothing in the cited literature does exactly this reduction from biased attention to randomized short-context primitives. The uniform guarantee over all QKV inputs would be a solid theoretical win if it held, and running experiments on actual large models is the right way to check practicality. The block-size bounds and near-linear runtime are the parts that could matter for scaling conversations and documents. The LSH construction itself shows some honest engagement with how positional information can be encoded via hashing rather than embeddings or explicit biases. That said, the stress-test note is correct and not minor. Standard ALiBi sets B_ij = -m |i-j|, which goes negative and unbounded, while any distribution over binary masks has expectations strictly in [0,1]. The abstract states the equality directly, so either the paper works with a rescaled or shifted version of ALiBi, applies the masks inside the softmax in a non-linear way that isn't described, or the structural theorem simply does not hold. The later approximation claims rest on this identity, which makes the foundation shaky until the full derivations are checked. Minor wording slip or deeper mismatch, it needs fixing before the rest can be trusted. This is for people already working on efficient long-context attention or ALiBi variants. A reader who wants new algorithmic connections between hashing and positional mechanisms could get something out of it even with revisions. It deserves a serious referee to examine the proofs and see whether the range issue is resolved in the body or whether the experiments actually test the claimed uniform approximation.

Referee Report

2 major / 0 minor

Summary. The paper claims that the ALiBi bias matrix equals the expectation of contiguous block-diagonal binary masks induced by a positional LSH scheme. Sampling from this scheme yields spectral-norm and max-norm approximation guarantees with bounded block sizes (with high probability). This implies a uniform approximation theorem for ALiBi-biased attention: with high probability the approximate output is accurate for all QKV inputs and can be computed in near-linear time by reducing long-context ALiBi to randomized short-context unbiased attention. Experiments on LLMs are said to validate the theory.

Significance. If the structural theorem and approximation guarantees hold, the work would establish a formal link between positional bias, masking, and LSH, offering a principled route to efficient long-context ALiBi attention without data-dependent tuning. The uniform-over-all-inputs guarantee and the reduction to standard attention would be notable contributions to the theory of approximate attention mechanisms.

major comments (2)

[Abstract] Abstract: The central claim that 'the ALiBi bias matrix is the expectation of contiguous block-diagonal binary masks' cannot hold literally. Standard ALiBi defines B_{i,j} = -m · |i-j| (negative and unbounded below), while any binary mask M satisfies 0 ≤ E[M_{i,j}] ≤ 1. No distribution over such masks can satisfy E[M] = B when B has entries outside [0,1]. All subsequent spectral/max-norm bounds, the uniform approximation theorem, and the near-linear-time reduction rest on this equality and are therefore unsupported as stated.
[Abstract / central theorem] The weakest assumption listed in the manuscript (existence of a positional LSH scheme whose induced masks have expectation exactly equal to the ALiBi matrix) is impossible under the standard definition of ALiBi; any fix would require either a nonlinear transformation of the masks (e.g., log(E[M]) or hard masking inside softmax) or a rescaled non-negative version of the bias, neither of which is indicated in the abstract or the stated theorem.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and precise reading of the manuscript. The identified inconsistency in the central claim is a genuine oversight in our formulation, and we agree that the literal statement cannot hold. We will revise the abstract, central theorem, and related sections to correct this while retaining the core technical contributions on positional LSH, approximation bounds, and the reduction to short-context attention.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'the ALiBi bias matrix is the expectation of contiguous block-diagonal binary masks' cannot hold literally. Standard ALiBi defines B_{i,j} = -m · |i-j| (negative and unbounded below), while any binary mask M satisfies 0 ≤ E[M_{i,j}] ≤ 1. No distribution over such masks can satisfy E[M] = B when B has entries outside [0,1]. All subsequent spectral/max-norm bounds, the uniform approximation theorem, and the near-linear-time reduction rest on this equality and are therefore unsupported as stated.

Authors: We agree that a direct equality E[M] = B is mathematically impossible given the range mismatch. This was an imprecise formulation in the abstract. The intended relationship is that the expectation of the positional LSH masks provides a non-negative matrix that, after an affine transformation (shift and scale to match the slope of ALiBi), approximates the bias structure. We will revise the abstract to state that the ALiBi bias is recovered from the expectation of the masks up to this transformation, and we will restate the spectral-norm and max-norm bounds as approximation guarantees between the (transformed) bias matrix and the empirical mean of the sampled masks. The uniform approximation theorem and near-linear-time reduction will be updated to apply to the transformed setting; the reduction to randomized short-context unbiased attention remains valid because the masks are still contiguous block-diagonal. We will also add a short section clarifying the transformation. revision: yes
Referee: [Abstract / central theorem] The weakest assumption listed in the manuscript (existence of a positional LSH scheme whose induced masks have expectation exactly equal to the ALiBi matrix) is impossible under the standard definition of ALiBi; any fix would require either a nonlinear transformation of the masks (e.g., log(E[M]) or hard masking inside softmax) or a rescaled non-negative version of the bias, neither of which is indicated in the abstract or the stated theorem.

Authors: The referee is correct that the weakest assumption cannot hold literally. We will revise the central theorem to adopt a rescaled non-negative version of the ALiBi bias (mapping the negative linear bias to [0,1] via a simple shift-and-scale) so that the expectation equality becomes feasible. Alternatively, we will present the result using log(E[M]) inside the attention score to recover the original negative bias; we prefer the rescaled version for simplicity and will indicate this choice explicitly in the revised abstract and theorem statement. The approximation guarantees, uniform-over-inputs property, and complexity reduction carry over directly to the revised setting with only minor adjustments to constants. The LLM experiments will be re-described as validating the corrected theorem. revision: yes

Circularity Check

1 steps flagged

Central structural theorem reduces to definition by construction of the positional LSH scheme

specific steps

self definitional [Abstract]
"We show that the ALiBi bias matrix is the expectation of contiguous block-diagonal binary masks induced by a ``positional LSH'' scheme."

The positional LSH scheme is introduced precisely so that its induced masks have expectation equal to the ALiBi bias matrix. The 'show that' statement is therefore true by the definition of the scheme rather than a derived result from independent properties of LSH or ALiBi.

full rationale

The paper's core claim is that the ALiBi bias matrix equals the expectation of binary masks from a newly introduced 'positional LSH' scheme. This equality holds by the way the scheme is defined (to induce exactly that expectation), rather than emerging as an independent derivation. Subsequent approximation bounds and uniform attention guarantees then rest on this constructed equality plus standard concentration inequalities. The range mismatch between negative unbounded ALiBi entries and [0,1]-valued mask expectations further indicates the claimed literal equality cannot hold without additional (unquoted) rescaling or nonlinear mapping, but the circularity is already present in the self-definitional construction itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of a positional LSH whose masks average exactly to the ALiBi matrix; this is an ad-hoc construction introduced for the paper rather than derived from prior independent results.

free parameters (1)

block size bound
The approximation guarantees are stated for bounded block sizes, which functions as a tunable parameter controlling the trade-off between approximation quality and runtime.

axioms (1)

ad hoc to paper There exists a positional LSH scheme such that the expectation of its contiguous block-diagonal binary masks equals the ALiBi bias matrix.
This identity is the load-bearing structural theorem asserted in the abstract.

invented entities (1)

positional LSH no independent evidence
purpose: Generates random binary masks whose expectation recovers the ALiBi positional bias.
Newly defined hashing scheme tailored to positions; no independent evidence outside the paper is supplied.

pith-pipeline@v0.9.0 · 5494 in / 1465 out tokens · 65347 ms · 2026-05-12T04:24:37.658366+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
spectral norm and max-norm approximation guarantees... uniform approximation theorem for ALiBi-biased attention

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 7 internal anchors

[1]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page
[2]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Train short, test long: Attention with linear biases enables input length extrapolation , author=. arXiv preprint arXiv:2108.12409 , year=

work page internal anchor Pith review arXiv
[3]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

work page 2024
[4]

Longformer: The Long-Document Transformer

Longformer: The long-document transformer , author=. arXiv preprint arXiv:2004.05150 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2004
[5]

Proceedings of the thirtieth annual ACM symposium on Theory of computing , pages=

Approximate nearest neighbors: towards removing the curse of dimensionality , author=. Proceedings of the thirtieth annual ACM symposium on Theory of computing , pages=

work page
[6]

Proceedings of the thiry-fourth annual ACM symposium on Theory of computing , pages=

Similarity estimation techniques from rounding algorithms , author=. Proceedings of the thiry-fourth annual ACM symposium on Theory of computing , pages=

work page
[7]

2024 , month =

Cho, Sanghun and Press, Ofir and Dao, Tri , title =. 2024 , month =

work page 2024
[8]

Advances in neural information processing systems , volume=

Random features for large-scale kernel machines , author=. Advances in neural information processing systems , volume=

work page
[9]

Advances in neural information processing systems , volume=

Space and time efficient kernel density estimation in high dimensions , author=. Advances in neural information processing systems , volume=

work page
[10]

Dimension reduction in kernel spaces from locality-sensitive hashing , author=

work page
[11]

International Conference on Machine Learning , pages=

Fast private kernel density estimation via locality sensitive quantization , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[12]

2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS) , pages=

Kernel density estimation through density constrained near neighbor search , author=. 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS) , pages=. 2020 , organization=

work page 2020
[13]

Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security , pages=

A one-pass distributed and private sketch for kernel sums with applications to machine learning at scale , author=. Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security , pages=

work page 2021
[14]

Proceedings of The Web Conference 2020 , pages=

Sub-linear race sketches for approximate kernel density estimation on streaming data , author=. Proceedings of The Web Conference 2020 , pages=

work page 2020
[15]

International Conference on Machine Learning , pages=

Rehashing kernel evaluation in high dimensions , author=. International Conference on Machine Learning , pages=. 2019 , organization=

work page 2019
[16]

2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS) , pages=

Hashing-based-estimators for kernel density in high dimensions , author=. 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS) , pages=. 2017 , organization=

work page 2017
[17]

Advances in neural information processing systems , volume=

Practical and optimal LSH for angular distance , author=. Advances in neural information processing systems , volume=

work page
[18]

Proceedings of the forty-seventh annual ACM symposium on Theory of computing , pages=

Optimal data-dependent hashing for approximate near neighbors , author=. Proceedings of the forty-seventh annual ACM symposium on Theory of computing , pages=

work page
[19]

Communications of the ACM , volume=

Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions , author=. Communications of the ACM , volume=. 2008 , publisher=

work page 2008
[20]

Vldb , volume=

Similarity search in high dimensions via hashing , author=. Vldb , volume=

work page
[21]

International Conference on Learning Representations , year=

Reformer: The efficient transformer , author=. International Conference on Learning Representations , year=

work page
[22]

International Conference on Learning Representations , year=

HyperAttention: Long-context Attention in Near-Linear Time , author=. International Conference on Learning Representations , year=

work page
[23]

The Fourteenth International Conference on Learning Representations , year=

RACE Attention: A Strictly Linear-Time Attention for Long-Sequence Training , author=. The Fourteenth International Conference on Learning Representations , year=

work page
[24]

The Thirteenth International Conference on Learning Representations , year=

Improved Algorithms for Kernel Matrix-Vector Multiplication Under Sparsity Assumptions , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[25]

The Thirteenth International Conference on Learning Representations , year=

MagicPIG: LSH Sampling for Efficient LLM Generation , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[26]

Advances in neural information processing systems , volume=

Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS) , author=. Advances in neural information processing systems , volume=

work page
[27]

International Conference on Machine Learning , pages=

On symmetric and asymmetric lshs for inner product search , author=. International Conference on Machine Learning , pages=. 2015 , organization=

work page 2015
[28]

Forty-second International Conference on Machine Learning , year=

HashAttention: Semantic Sparsity for Faster Inference , author=. Forty-second International Conference on Machine Learning , year=

work page
[29]

2019 , note=

Sub-Gamma Variables and Concentration , author=. 2019 , note=

work page 2019
[30]

Foundations of computational mathematics , volume=

User-friendly tail bounds for sums of random matrices , author=. Foundations of computational mathematics , volume=. 2012 , publisher=

work page 2012
[31]

Foundations and Trends

Toeplitz and circulant matrices: A review , author=. Foundations and Trends. 2006 , publisher=

work page 2006
[32]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

FlashBias: Fast Computation of Attention with Bias , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[33]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Swin transformer: Hierarchical vision transformer using shifted windows , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[34]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Swin transformer v2: Scaling up capacity and resolution , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[35]

Nature , volume=

Accurate structure prediction of biomolecular interactions with AlphaFold 3 , author=. Nature , volume=. 2024 , publisher=

work page 2024
[36]

nature , volume=

Highly accurate protein structure prediction with AlphaFold , author=. nature , volume=. 2021 , publisher=

work page 2021
[37]

Nature , volume=

Improved protein structure prediction using potentials from deep learning , author=. Nature , volume=. 2020 , publisher=

work page 2020
[38]

Kakade, and Eran Malach

Repeat after me: Transformers are better than state space models at copying , author=. arXiv preprint arXiv:2402.01032 , year=

work page arXiv
[39]

Advances in Neural Information Processing Systems , volume=

The impact of positional encoding on length generalization in transformers , author=. Advances in Neural Information Processing Systems , volume=

work page
[40]

Positional Encoding via Token-Aware Phase Attention

Positional Encoding via Token-Aware Phase Attention , author=. arXiv preprint arXiv:2509.12635 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

The Thirteenth International Conference on Learning Representations , year=

Round and Round We Go! What makes Rotary Positional Encodings useful? , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[42]

The Twelfth International Conference on Learning Representations , year=

YaRN: Efficient Context Window Extension of Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=

work page
[43]

Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pages=

Self-attention with relative position representations , author=. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pages=

work page 2018
[44]

Advances in neural information processing systems , volume=

Flashattention: Fast and memory-efficient exact attention with io-awareness , author=. Advances in neural information processing systems , volume=

work page
[45]

The Twelfth International Conference on Learning Representations , year=

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author=. The Twelfth International Conference on Learning Representations , year=

work page
[46]

Advances in Neural Information Processing Systems , volume=

Flashattention-3: Fast and accurate attention with asynchrony and low-precision , author=. Advances in Neural Information Processing Systems , volume=

work page
[47]

arXiv preprint arXiv:1703.05160 , year=

A new unbiased and efficient class of lsh-based samplers and estimators for partition function computation in log-linear models , author=. arXiv preprint arXiv:1703.05160 , year=

work page arXiv
[48]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[49]

Mistral 7B

Mistral 7B , author=. arXiv preprint arXiv:2310.06825 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Pointer Sentinel Mixture Models

Pointer sentinel mixture models , author=. arXiv preprint arXiv:1609.07843 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Journal of machine learning research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=

work page
[52]

The Twelfth International Conference on Learning Representations (ICLR) , year=

Functional Interpolation for Relative Positions improves Long Context Transformers , author=. The Twelfth International Conference on Learning Representations (ICLR) , year=

work page
[53]

Advances in Neural Information Processing Systems , volume=

Kerple: Kernelized relative positional embedding for length extrapolation , author=. Advances in Neural Information Processing Systems , volume=

work page
[54]

Decoupled Weight Decay Regularization

Decoupled Weight Decay Regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

LoRA: Low-Rank Adaptation of Large Language Models

LoRA: Low-Rank Adaptation of Large Language Models , author=. arXiv preprint arXiv:2106.09685 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

International Conference on Learning Representations (ICLR) , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations (ICLR) , year=

work page