pith. machine review for the scientific record. sign in

arxiv: 2605.09472 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.DS

Recognition: 2 theorem links

· Lean Theorem

Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases

Daniel Wolfson, Tal Wagner

Pith reviewed 2026-05-12 04:24 UTC · model grok-4.3

classification 💻 cs.LG cs.DS
keywords ALiBipositional LSHattention approximationlinear biasblock matrixlong-context transformerslocality-sensitive hashingbinary masks
0
0 comments X

The pith

The ALiBi bias matrix equals the expectation of random contiguous block-diagonal binary masks from a positional LSH scheme.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the ALiBi positional bias in attention arises exactly as the average of certain random binary matrices. These matrices come from a hashing procedure that creates block patterns along the sequence positions. Sampling and averaging a modest number of such matrices produces approximations with explicit error bounds that hold for every possible query, key, and value at once. This turns the quadratic cost of long-context ALiBi attention into a near-linear collection of short, ordinary attention steps. A reader would care because the same construction also supplies a single formal language that treats positional biases, masks, and embeddings together.

Core claim

The ALiBi bias matrix is the expectation of contiguous block-diagonal binary masks induced by a positional LSH scheme. The empirical mean of masks sampled from this scheme yields spectral norm and max-norm approximation guarantees with bounded block sizes with high probability. This structural theorem implies a uniform approximation theorem for ALiBi-biased attention: with high probability over the sampled masks, the approximate attention output is accurate simultaneously for all query-key-value inputs and can be computed in near-linear time in the context length, reducing long-context ALiBi to a collection of randomized short-context regular attention operations.

What carries the argument

positional LSH scheme that generates contiguous block-diagonal binary masks whose expectation is exactly the ALiBi bias matrix

If this is right

  • ALiBi attention becomes uniformly accurate for all inputs once enough random masks are averaged.
  • Total runtime scales near-linearly with context length rather than quadratically.
  • Each long sequence reduces to several independent short-context unbiased attention computations.
  • Spectral and max-norm error bounds hold with high probability for any fixed block size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hashing view might yield approximations for other positional bias families beyond ALiBi.
  • Practical code could combine this sampling with existing sparse-attention kernels to cut memory further.
  • Empirical tests on models of increasing size would show whether the high-probability bounds produce measurable wall-clock gains.

Load-bearing premise

A positional LSH scheme must exist whose induced masks have expectation exactly equal to the ALiBi bias matrix, and the random samples must approximate it uniformly for every query-key-value triple without input-dependent tuning.

What would settle it

Generate many masks from the positional LSH scheme, form their empirical mean, and test whether the spectral-norm distance to the true ALiBi matrix stays below the claimed bound as block size grows; alternatively, measure whether the attention output error for arbitrary query-key-value triples exceeds the predicted uniform bound.

Figures

Figures reproduced from arXiv: 2605.09472 by Daniel Wolfson, Tal Wagner.

Figure 1
Figure 1. Figure 1: ALiBi approximation with binary block-diagonal matrices by positional LSH. 3 Positional LSH In this section we present the positional LSH framework and prove Theorem 2.1. 3.1 Attention with Bias via Locality Sensitive Hashing Let 𝑈 be a space of objects and ker ∶ 𝑈 × 𝑈 → [0, 1] a similarity map (i.e., a kernel) over 𝑈. An LSH scheme for ker is a distribution  over hash functions operating on 𝑈 such that f… view at source ↗
Figure 2
Figure 2. Figure 2: ‖𝐿⋆ − 𝑀̃‖, ‖𝐿⋆ − 𝑀̃‖max and ‖𝑇 ⋆ − 𝑇̃⋆‖max as functions of the sample size 𝑠, for four attention heads with various values of 𝜎, for the models Llama-4-Scout-17B-16E and Mistral-7B. Both axes show values on a log scale. Shaded regions are mean±std across inputs. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
read the original abstract

Positional encoding in transformers is commonly implemented through positional embeddings, attention masks, or bias terms, but formal connections between these mechanisms remain limited. We study attention with positional bias through the lens of locality-sensitive hashing (LSH), focusing on Attention with Linear Biases (ALiBi). We show that the ALiBi bias matrix is the expectation of contiguous block-diagonal binary masks induced by a ``positional LSH'' scheme. The empirical mean of masks sampled from this scheme yields spectral norm and max-norm approximation guarantees with bounded block sizes with high probability. This structural theorem implies a uniform approximation theorem for ALiBi-biased attention: with high probability over the sampled masks, the approximate attention output is accurate simultaneously for all query-key-value inputs and can be computed in near-linear time in the context length, reducing long-context ALiBi to a collection of randomized short-context regular (positionally unbiased) attention operations. Conceptually, this connects positional bias, masks, and positional embeddings in a single formal framework and suggests an approach to efficient ALiBi-biased attention. Experiments on large language models validate our theoretical findings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that the ALiBi bias matrix equals the expectation of contiguous block-diagonal binary masks induced by a positional LSH scheme. Sampling from this scheme yields spectral-norm and max-norm approximation guarantees with bounded block sizes (with high probability). This implies a uniform approximation theorem for ALiBi-biased attention: with high probability the approximate output is accurate for all QKV inputs and can be computed in near-linear time by reducing long-context ALiBi to randomized short-context unbiased attention. Experiments on LLMs are said to validate the theory.

Significance. If the structural theorem and approximation guarantees hold, the work would establish a formal link between positional bias, masking, and LSH, offering a principled route to efficient long-context ALiBi attention without data-dependent tuning. The uniform-over-all-inputs guarantee and the reduction to standard attention would be notable contributions to the theory of approximate attention mechanisms.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'the ALiBi bias matrix is the expectation of contiguous block-diagonal binary masks' cannot hold literally. Standard ALiBi defines B_{i,j} = -m · |i-j| (negative and unbounded below), while any binary mask M satisfies 0 ≤ E[M_{i,j}] ≤ 1. No distribution over such masks can satisfy E[M] = B when B has entries outside [0,1]. All subsequent spectral/max-norm bounds, the uniform approximation theorem, and the near-linear-time reduction rest on this equality and are therefore unsupported as stated.
  2. [Abstract / central theorem] The weakest assumption listed in the manuscript (existence of a positional LSH scheme whose induced masks have expectation exactly equal to the ALiBi matrix) is impossible under the standard definition of ALiBi; any fix would require either a nonlinear transformation of the masks (e.g., log(E[M]) or hard masking inside softmax) or a rescaled non-negative version of the bias, neither of which is indicated in the abstract or the stated theorem.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and precise reading of the manuscript. The identified inconsistency in the central claim is a genuine oversight in our formulation, and we agree that the literal statement cannot hold. We will revise the abstract, central theorem, and related sections to correct this while retaining the core technical contributions on positional LSH, approximation bounds, and the reduction to short-context attention.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'the ALiBi bias matrix is the expectation of contiguous block-diagonal binary masks' cannot hold literally. Standard ALiBi defines B_{i,j} = -m · |i-j| (negative and unbounded below), while any binary mask M satisfies 0 ≤ E[M_{i,j}] ≤ 1. No distribution over such masks can satisfy E[M] = B when B has entries outside [0,1]. All subsequent spectral/max-norm bounds, the uniform approximation theorem, and the near-linear-time reduction rest on this equality and are therefore unsupported as stated.

    Authors: We agree that a direct equality E[M] = B is mathematically impossible given the range mismatch. This was an imprecise formulation in the abstract. The intended relationship is that the expectation of the positional LSH masks provides a non-negative matrix that, after an affine transformation (shift and scale to match the slope of ALiBi), approximates the bias structure. We will revise the abstract to state that the ALiBi bias is recovered from the expectation of the masks up to this transformation, and we will restate the spectral-norm and max-norm bounds as approximation guarantees between the (transformed) bias matrix and the empirical mean of the sampled masks. The uniform approximation theorem and near-linear-time reduction will be updated to apply to the transformed setting; the reduction to randomized short-context unbiased attention remains valid because the masks are still contiguous block-diagonal. We will also add a short section clarifying the transformation. revision: yes

  2. Referee: [Abstract / central theorem] The weakest assumption listed in the manuscript (existence of a positional LSH scheme whose induced masks have expectation exactly equal to the ALiBi matrix) is impossible under the standard definition of ALiBi; any fix would require either a nonlinear transformation of the masks (e.g., log(E[M]) or hard masking inside softmax) or a rescaled non-negative version of the bias, neither of which is indicated in the abstract or the stated theorem.

    Authors: The referee is correct that the weakest assumption cannot hold literally. We will revise the central theorem to adopt a rescaled non-negative version of the ALiBi bias (mapping the negative linear bias to [0,1] via a simple shift-and-scale) so that the expectation equality becomes feasible. Alternatively, we will present the result using log(E[M]) inside the attention score to recover the original negative bias; we prefer the rescaled version for simplicity and will indicate this choice explicitly in the revised abstract and theorem statement. The approximation guarantees, uniform-over-inputs property, and complexity reduction carry over directly to the revised setting with only minor adjustments to constants. The LLM experiments will be re-described as validating the corrected theorem. revision: yes

Circularity Check

1 steps flagged

Central structural theorem reduces to definition by construction of the positional LSH scheme

specific steps
  1. self definitional [Abstract]
    "We show that the ALiBi bias matrix is the expectation of contiguous block-diagonal binary masks induced by a ``positional LSH'' scheme."

    The positional LSH scheme is introduced precisely so that its induced masks have expectation equal to the ALiBi bias matrix. The 'show that' statement is therefore true by the definition of the scheme rather than a derived result from independent properties of LSH or ALiBi.

full rationale

The paper's core claim is that the ALiBi bias matrix equals the expectation of binary masks from a newly introduced 'positional LSH' scheme. This equality holds by the way the scheme is defined (to induce exactly that expectation), rather than emerging as an independent derivation. Subsequent approximation bounds and uniform attention guarantees then rest on this constructed equality plus standard concentration inequalities. The range mismatch between negative unbounded ALiBi entries and [0,1]-valued mask expectations further indicates the claimed literal equality cannot hold without additional (unquoted) rescaling or nonlinear mapping, but the circularity is already present in the self-definitional construction itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of a positional LSH whose masks average exactly to the ALiBi matrix; this is an ad-hoc construction introduced for the paper rather than derived from prior independent results.

free parameters (1)
  • block size bound
    The approximation guarantees are stated for bounded block sizes, which functions as a tunable parameter controlling the trade-off between approximation quality and runtime.
axioms (1)
  • ad hoc to paper There exists a positional LSH scheme such that the expectation of its contiguous block-diagonal binary masks equals the ALiBi bias matrix.
    This identity is the load-bearing structural theorem asserted in the abstract.
invented entities (1)
  • positional LSH no independent evidence
    purpose: Generates random binary masks whose expectation recovers the ALiBi positional bias.
    Newly defined hashing scheme tailored to positions; no independent evidence outside the paper is supplied.

pith-pipeline@v0.9.0 · 5494 in / 1465 out tokens · 65347 ms · 2026-05-12T04:24:37.658366+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 7 internal anchors

  1. [1]

    Advances in neural information processing systems , volume=

    Attention is all you need , author=. Advances in neural information processing systems , volume=

  2. [2]

    Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

    Train short, test long: Attention with linear biases enables input length extrapolation , author=. arXiv preprint arXiv:2108.12409 , year=

  3. [3]

    Neurocomputing , volume=

    Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

  4. [4]

    Longformer: The Long-Document Transformer

    Longformer: The long-document transformer , author=. arXiv preprint arXiv:2004.05150 , year=

  5. [5]

    Proceedings of the thirtieth annual ACM symposium on Theory of computing , pages=

    Approximate nearest neighbors: towards removing the curse of dimensionality , author=. Proceedings of the thirtieth annual ACM symposium on Theory of computing , pages=

  6. [6]

    Proceedings of the thiry-fourth annual ACM symposium on Theory of computing , pages=

    Similarity estimation techniques from rounding algorithms , author=. Proceedings of the thiry-fourth annual ACM symposium on Theory of computing , pages=

  7. [7]

    2024 , month =

    Cho, Sanghun and Press, Ofir and Dao, Tri , title =. 2024 , month =

  8. [8]

    Advances in neural information processing systems , volume=

    Random features for large-scale kernel machines , author=. Advances in neural information processing systems , volume=

  9. [9]

    Advances in neural information processing systems , volume=

    Space and time efficient kernel density estimation in high dimensions , author=. Advances in neural information processing systems , volume=

  10. [10]

    Dimension reduction in kernel spaces from locality-sensitive hashing , author=

  11. [11]

    International Conference on Machine Learning , pages=

    Fast private kernel density estimation via locality sensitive quantization , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  12. [12]

    2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS) , pages=

    Kernel density estimation through density constrained near neighbor search , author=. 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS) , pages=. 2020 , organization=

  13. [13]

    Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security , pages=

    A one-pass distributed and private sketch for kernel sums with applications to machine learning at scale , author=. Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security , pages=

  14. [14]

    Proceedings of The Web Conference 2020 , pages=

    Sub-linear race sketches for approximate kernel density estimation on streaming data , author=. Proceedings of The Web Conference 2020 , pages=

  15. [15]

    International Conference on Machine Learning , pages=

    Rehashing kernel evaluation in high dimensions , author=. International Conference on Machine Learning , pages=. 2019 , organization=

  16. [16]

    2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS) , pages=

    Hashing-based-estimators for kernel density in high dimensions , author=. 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS) , pages=. 2017 , organization=

  17. [17]

    Advances in neural information processing systems , volume=

    Practical and optimal LSH for angular distance , author=. Advances in neural information processing systems , volume=

  18. [18]

    Proceedings of the forty-seventh annual ACM symposium on Theory of computing , pages=

    Optimal data-dependent hashing for approximate near neighbors , author=. Proceedings of the forty-seventh annual ACM symposium on Theory of computing , pages=

  19. [19]

    Communications of the ACM , volume=

    Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions , author=. Communications of the ACM , volume=. 2008 , publisher=

  20. [20]

    Vldb , volume=

    Similarity search in high dimensions via hashing , author=. Vldb , volume=

  21. [21]

    International Conference on Learning Representations , year=

    Reformer: The efficient transformer , author=. International Conference on Learning Representations , year=

  22. [22]

    International Conference on Learning Representations , year=

    HyperAttention: Long-context Attention in Near-Linear Time , author=. International Conference on Learning Representations , year=

  23. [23]

    The Fourteenth International Conference on Learning Representations , year=

    RACE Attention: A Strictly Linear-Time Attention for Long-Sequence Training , author=. The Fourteenth International Conference on Learning Representations , year=

  24. [24]

    The Thirteenth International Conference on Learning Representations , year=

    Improved Algorithms for Kernel Matrix-Vector Multiplication Under Sparsity Assumptions , author=. The Thirteenth International Conference on Learning Representations , year=

  25. [25]

    The Thirteenth International Conference on Learning Representations , year=

    MagicPIG: LSH Sampling for Efficient LLM Generation , author=. The Thirteenth International Conference on Learning Representations , year=

  26. [26]

    Advances in neural information processing systems , volume=

    Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS) , author=. Advances in neural information processing systems , volume=

  27. [27]

    International Conference on Machine Learning , pages=

    On symmetric and asymmetric lshs for inner product search , author=. International Conference on Machine Learning , pages=. 2015 , organization=

  28. [28]

    Forty-second International Conference on Machine Learning , year=

    HashAttention: Semantic Sparsity for Faster Inference , author=. Forty-second International Conference on Machine Learning , year=

  29. [29]

    2019 , note=

    Sub-Gamma Variables and Concentration , author=. 2019 , note=

  30. [30]

    Foundations of computational mathematics , volume=

    User-friendly tail bounds for sums of random matrices , author=. Foundations of computational mathematics , volume=. 2012 , publisher=

  31. [31]

    Foundations and Trends

    Toeplitz and circulant matrices: A review , author=. Foundations and Trends. 2006 , publisher=

  32. [32]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    FlashBias: Fast Computation of Attention with Bias , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  33. [33]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Swin transformer: Hierarchical vision transformer using shifted windows , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  34. [34]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Swin transformer v2: Scaling up capacity and resolution , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  35. [35]

    Nature , volume=

    Accurate structure prediction of biomolecular interactions with AlphaFold 3 , author=. Nature , volume=. 2024 , publisher=

  36. [36]

    nature , volume=

    Highly accurate protein structure prediction with AlphaFold , author=. nature , volume=. 2021 , publisher=

  37. [37]

    Nature , volume=

    Improved protein structure prediction using potentials from deep learning , author=. Nature , volume=. 2020 , publisher=

  38. [38]

    Kakade, and Eran Malach

    Repeat after me: Transformers are better than state space models at copying , author=. arXiv preprint arXiv:2402.01032 , year=

  39. [39]

    Advances in Neural Information Processing Systems , volume=

    The impact of positional encoding on length generalization in transformers , author=. Advances in Neural Information Processing Systems , volume=

  40. [40]

    Positional Encoding via Token-Aware Phase Attention

    Positional Encoding via Token-Aware Phase Attention , author=. arXiv preprint arXiv:2509.12635 , year=

  41. [41]

    The Thirteenth International Conference on Learning Representations , year=

    Round and Round We Go! What makes Rotary Positional Encodings useful? , author=. The Thirteenth International Conference on Learning Representations , year=

  42. [42]

    The Twelfth International Conference on Learning Representations , year=

    YaRN: Efficient Context Window Extension of Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=

  43. [43]

    Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pages=

    Self-attention with relative position representations , author=. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pages=

  44. [44]

    Advances in neural information processing systems , volume=

    Flashattention: Fast and memory-efficient exact attention with io-awareness , author=. Advances in neural information processing systems , volume=

  45. [45]

    The Twelfth International Conference on Learning Representations , year=

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author=. The Twelfth International Conference on Learning Representations , year=

  46. [46]

    Advances in Neural Information Processing Systems , volume=

    Flashattention-3: Fast and accurate attention with asynchrony and low-precision , author=. Advances in Neural Information Processing Systems , volume=

  47. [47]

    arXiv preprint arXiv:1703.05160 , year=

    A new unbiased and efficient class of lsh-based samplers and estimators for partition function computation in log-linear models , author=. arXiv preprint arXiv:1703.05160 , year=

  48. [48]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  49. [49]

    Mistral 7B

    Mistral 7B , author=. arXiv preprint arXiv:2310.06825 , year=

  50. [50]

    Pointer Sentinel Mixture Models

    Pointer sentinel mixture models , author=. arXiv preprint arXiv:1609.07843 , year=

  51. [51]

    Journal of machine learning research , volume=

    Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=

  52. [52]

    The Twelfth International Conference on Learning Representations (ICLR) , year=

    Functional Interpolation for Relative Positions improves Long Context Transformers , author=. The Twelfth International Conference on Learning Representations (ICLR) , year=

  53. [53]

    Advances in Neural Information Processing Systems , volume=

    Kerple: Kernelized relative positional embedding for length extrapolation , author=. Advances in Neural Information Processing Systems , volume=

  54. [54]

    Decoupled Weight Decay Regularization

    Decoupled Weight Decay Regularization , author=. arXiv preprint arXiv:1711.05101 , year=

  55. [55]

    LoRA: Low-Rank Adaptation of Large Language Models

    LoRA: Low-Rank Adaptation of Large Language Models , author=. arXiv preprint arXiv:2106.09685 , year=

  56. [56]

    International Conference on Learning Representations (ICLR) , year=

    Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations (ICLR) , year=