pith. machine review for the scientific record. sign in

arxiv: 2605.11921 · v1 · submitted 2026-05-12 · 💻 cs.DS · cs.IR

Recognition: no theorem link

On the LSH Distortion of Ulam and Cayley Similarities

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:49 UTC · model grok-4.3

classification 💻 cs.DS cs.IR
keywords LSH distortionUlam similarityCayley similaritypermutationslocality-sensitive hashingnearest neighbor searchsimilarity measures
0
0 comments X

The pith

Ulam similarity on permutations has LSH distortion O(n / sqrt(log n)) with lower bound Omega(n to the 0.12), while Cayley similarity has distortion Theta(n).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper determines the LSH distortion for the Ulam and Cayley similarities defined on permutations of n elements. It establishes that the Ulam similarity admits an LSH scheme with multiplicative distortion O(n over the square root of log n) and proves that no scheme can achieve better than Omega(n to the power 0.12). By contrast, any LSH scheme for the Cayley similarity must incur distortion Theta(n). These results clarify the overhead of using locality-sensitive hashing to accelerate nearest-neighbor search when the underlying measure is one of these two permutation similarities.

Core claim

The paper shows that the Ulam similarity admits a sublinear LSH distortion of O(n / √log n) and proves a matching-style lower bound of Ω(n^{0.12}) on the best achievable distortion. It further shows that the LSH distortion of the Cayley similarity is Θ(n).

What carries the argument

LSH distortion: the smallest multiplicative scaling factor c such that the scaled similarity c·S admits an exact locality-sensitive hash family where the collision probability equals the scaled similarity value.

If this is right

  • An LSH family exists for the Ulam similarity whose collision probabilities match the similarity up to a factor O(n / sqrt(log n)).
  • No LSH family for the Ulam similarity can achieve collision probabilities within a factor o(n^{0.12}) of the true similarity.
  • Any LSH family for the Cayley similarity must distort the similarity by a factor Theta(n).
  • The overhead of LSH-based approximate nearest-neighbor search under Ulam similarity grows sublinearly in n, while under Cayley similarity it grows linearly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The polynomial gap between the Ulam upper and lower bounds indicates that further tightening of the distortion analysis remains possible.
  • For data sets of permutations, Cayley-based search may require index structures other than LSH to avoid linear blowup.
  • Similar distortion calculations could be performed for other common permutation distances such as Kendall tau.

Load-bearing premise

The multiplicative definition of LSH distortion, together with normalizing similarities to the interval [0,1], is assumed to capture the relevant performance limits of hashing for nearest-neighbor tasks.

What would settle it

An explicit LSH family for the Ulam similarity whose collision probabilities stay within a factor o(n / sqrt(log n)) of the true similarity values, or a direct computation for moderate n showing that the minimal required distortion grows faster than the claimed upper bound.

Figures

Figures reproduced from arXiv: 2605.11921 by Erasmo Tani, Flavio Chierichetti, Mirko Giacchini, Ravi Kumar.

Figure 1
Figure 1. Figure 1: The Young diagram for the partition λ = (5, 5, 2, 2, 1). A Young tableau for a partition λ ⊢ n, sometimes called a λ-tableau or a tableau of shape λ, is a Young diagram for λ in which every square has been filled by a distinct number in [n]. 11 8 3 10 14 13 7 2 9 6 4 12 1 5 [PITH_FULL_IMAGE:figures/full_fig_p024_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A Young tableau for the partition λ = (5, 4, 2, 2, 1). If λ ⊢ n, then Sn has a natural action on the set of λ-tableaux: given a tableau T the tableau π · T is obtained by replacing each number i with π(i). A Young tabloid of shape λ is an equivalence class of λ-tableaux under the equivalence relation in which two tableaux are equivalent if you can obtain one from the other by permuting the content of each … view at source ↗
read the original abstract

Locality-sensitive hashing (LSH) has found widespread use as a fundamental primitive, particularly to accelerate nearest neighbor search. An LSH scheme for a similarity function $S:\mathcal{X} \times \mathcal{X} \to [0,1]$ is a distribution over hash functions on $\mathcal{X}$ with the property that the probability of collision of any two elements $x,y\in \mathcal{X}$ is exactly equal to $S(x,y)$. However, not all similarity functions admit exact LSH schemes. The notion of LSH distortion measures how multiplicatively close a similarity function is to having an LSH scheme. In this work, we study the LSH distortion of the Ulam and Cayley similarities, which are popular similarity measures on permutations of $n$ elements. We show that the Ulam similarity admits a sublinear LSH distortion of $O(n / \sqrt{\log n})$; we also prove a lower bound of $\Omega(n^{0.12})$ on the best LSH distortion achievable. On the other hand, we show that the LSH distortion of the Cayley similarity is $\Theta(n)$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper studies the LSH distortion of Ulam and Cayley similarities on permutations of n elements. It proves that the Ulam similarity admits an LSH distortion of O(n / √log n) via an embedding construction into a space with exact LSH, together with a matching-style lower bound of Ω(n^{0.12}) obtained by a direct-product reduction; separately, it shows that the Cayley similarity has LSH distortion exactly Θ(n) by a variance argument on collision probabilities.

Significance. If the results hold, they give the first non-trivial characterization of LSH approximability for two standard permutation similarities, with the sublinear upper bound on Ulam and the explicit Ω(n^{0.12}) lower bound being technically notable; the Θ(n) result for Cayley is a clean negative statement. The proofs are internally consistent with the multiplicative distortion definition, and the embedding and reduction techniques strengthen the contribution beyond the abstract bounds.

minor comments (2)
  1. [Abstract] Abstract: the statement of the Ulam upper bound would be clearer if it briefly indicated the embedding technique used to obtain the O(n / √log n) factor.
  2. [Section 2] The normalization of similarities to [0,1] and the precise multiplicative distortion definition (referenced in Section 2) should be restated once in the main body with an explicit example for small n to aid readers unfamiliar with the LSH literature.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work on the LSH distortion of Ulam and Cayley similarities, including recognition of the sublinear upper bound for Ulam, the Ω(n^{0.12}) lower bound, and the tight Θ(n) characterization for Cayley. The recommendation for minor revision is noted.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's upper-bound construction for Ulam similarity (via embedding into a space with exact LSH) and lower-bound reduction (via direct product yielding the n^0.12 exponent), along with the simple variance argument for Cayley similarity's Θ(n) distortion, rely on the multiplicative distortion definition from Section 2 and independent probabilistic arguments. No steps reduce by construction to fitted parameters, self-definitional equivalences, or load-bearing self-citations; the derivation chain is self-contained against the stated definitions and external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claims rest on the standard definition of LSH and the conventional definitions of Ulam and Cayley similarities; no new parameters, axioms beyond standard math, or invented entities are introduced.

axioms (2)
  • standard math LSH scheme requires collision probability exactly equal to similarity S(x,y)
    This is the foundational definition from prior LSH literature invoked in the abstract.
  • domain assumption Ulam and Cayley similarities are normalized to [0,1] on permutations
    Assumed without further justification in the problem setup.

pith-pipeline@v0.9.0 · 5511 in / 1256 out tokens · 51365 ms · 2026-05-13T04:49:43.726773+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

    Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. CACM, 51 0 (1): 0 117--122, 2008

  2. [2]

    The computational hardness of estimating edit distance

    Alexandr Andoni and Robert Krauthgamer. The computational hardness of estimating edit distance. SICOMP, 39 0 (6): 0 2398--2429, 2010

  3. [3]

    Overcoming the l\( _ 1 \) non-embeddability barrier: algorithms for product metrics

    Alexandr Andoni, Piotr Indyk, and Robert Krauthgamer. Overcoming the l\( _ 1 \) non-embeddability barrier: algorithms for product metrics. In SODA, pages 865--874, 2009

  4. [4]

    On the resemblance and containment of documents

    Andrei Z Broder. On the resemblance and containment of documents. In SEQUENCES, pages 21--29, 1997

  5. [5]

    Broder, Moses Charikar, Alan M

    Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations. JCSS, 60 0 (3): 0 630--659, 2000

  6. [6]

    Similarity estimation techniques from rounding algorithms

    Moses Charikar. Similarity estimation techniques from rounding algorithms. In STOC, pages 380--388, 2002

  7. [7]

    Embedding the U lam metric into l\( _ 1 \)

    Moses Charikar and Robert Krauthgamer. Embedding the U lam metric into l\( _ 1 \). Theory Comput., 2 0 (11): 0 207--224, 2006

  8. [8]

    Kim, and William Kuszmaul

    Moses Charikar, Ofir Geri, Michael P. Kim, and William Kuszmaul. On estimating edit distance: Alignment, dimension reduction, and embeddings. In ICALP, pages 34:1--34:14, 2018

  9. [9]

    On the distortion of locality sensitive hashing

    Flavio Chierichetti, Ravi Kumar, Alessandro Panconesi, and Erisa Terolli. On the distortion of locality sensitive hashing. SICOMP, 48 0 (2): 0 350--372, 2019

  10. [10]

    Group Representations in Probability and Statistics

    Persi Diaconis. Group Representations in Probability and Statistics. Institute of Mathematical Statistics, 1988

  11. [11]

    Similarity search in high dimensions via hashing

    Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999

  12. [12]

    Approximate nearest neighbors: towards removing the curse of dimensionality

    Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In STOC, pages 604--613, 1998

  13. [13]

    Piotr Indyk, Rajeev Motwani, Prabhakar Raghavan, and Santosh S. Vempala. Locality-preserving hashing in multidimensional spaces. In STOC, pages 618--625, 1997

  14. [14]

    Group Theoretical Methods in Machine Learning

    Imre Risi Kondor. Group Theoretical Methods in Machine Learning. Columbia University, 2008

  15. [15]

    Characters of symmetric groups: sharp bounds and applications

    Michael Larsen and Aner Shalev. Characters of symmetric groups: sharp bounds and applications. Inventiones mathematicae, 174 0 (3): 0 645--687, 2008

  16. [16]

    Multi-probe LSH: efficient indexing for high-dimensional similarity search

    Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. Multi-probe LSH: efficient indexing for high-dimensional similarity search. In VLDB, pages 950--961, 2007

  17. [17]

    A combinatorial proof of a symmetric group character involution

    Paul Renteln. A combinatorial proof of a symmetric group character involution. The American Mathematical Monthly, 130 0 (9): 0 855--858, 2023

  18. [18]

    Upper bound on the characters of the symmetric groups

    Yuval Roichman. Upper bound on the characters of the symmetric groups. Invent. Math., 125: 0 451--485, 1996

  19. [19]

    The Symmetric Group: Representations, Combinatorial Algorithms, and Symmetric Functions

    Bruce Sagan. The Symmetric Group: Representations, Combinatorial Algorithms, and Symmetric Functions. Springer Science & Business Media, 2001

  20. [20]

    Bayesian locality sensitive hashing for fast similarity search

    Venu Satuluri and Srinivasan Parthasarathy. Bayesian locality sensitive hashing for fast similarity search. Proc. VLDB Endow. , 5 0 (5): 0 430--441, 2012