Recognition: unknown
A weighted angle distance on strings
Pith reviewed 2026-05-09 22:31 UTC · model grok-4.3
The pith
A new metric on strings sums angle distances between n-gram count vectors at every scale with exponential weights.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The author defines d_ρ as the sum over n of ρ^n times the angle between the two strings' n-gram count vectors. This quantity is shown to be a metric, to be robust under tandem-repeat insertions, to admit an exact linear-time algorithm via suffix trees, and to support effective clustering. The isometries of the resulting metric space are also characterized.
What carries the argument
d_ρ, the infinite sum of exponentially weighted angle distances between n-gram count vectors of two strings.
If this is right
- d_ρ produces DBSCAN clusters that match or exceed the quality obtained from edit distance and from single-scale n-gram distances.
- Exact values of d_ρ can be obtained in time linear in string length by building a generalized suffix tree.
- The distance is provably insensitive to the insertion of tandem repeats in a controlled way.
- The maps that preserve d_ρ distances between strings are fully identified.
Where Pith is reading between the lines
- Because the construction is scale-agnostic by design, it could be applied directly to biological sequences where motif length is unknown in advance.
- Varying ρ near 1 would emphasize longer n-grams without requiring a separate family of distances.
- The isometry characterization may reveal which string operations leave the multi-scale frequency structure unchanged.
Load-bearing premise
Exponentially weighting and summing angle distances between n-gram count vectors always yields a quantity that satisfies the triangle inequality for all strings and all values of ρ.
What would settle it
Any triple of strings A, B, C where the computed d_ρ(A,C) exceeds d_ρ(A,B) + d_ρ(B,C) for some ρ in (0,1), or a benchmark set on which d_ρ clustering quality falls clearly below both edit distance and fixed-n n-gram baselines.
Figures
read the original abstract
We define a multi-scale metric $d_\rho$ on strings by aggregating angle distances between all $n$-gram count vectors with exponential weights $\rho^n$. We benchmark $d_\rho$ in DBSCAN clustering against edit and $n$-gram baselines, give a linear-time suffix-tree algorithm for evaluation, prove metric and stability properties (including robustness under tandem-repeat stutters), and characterize isometries.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper defines a multi-scale metric d_ρ on strings by aggregating angle distances θ between all n-gram count vectors v_n(s) and v_n(t) with exponential weights ρ^n, i.e., d_ρ(s,t) = ∑_n ρ^n ⋅ θ(v_n(s), v_n(t)). It provides a linear-time suffix-tree algorithm for evaluation, proves that d_ρ satisfies the metric axioms and stability properties (including robustness to tandem-repeat stutters), benchmarks d_ρ in DBSCAN clustering against edit-distance and plain n-gram baselines, and characterizes the isometries of the resulting space.
Significance. If the definition is completed to be fully rigorous, the construction supplies a computationally efficient (linear-time) multi-scale string metric with explicit stability guarantees under biologically relevant perturbations such as tandem repeats. The combination of a suffix-tree algorithm, metric proofs, and empirical clustering comparisons is a concrete strength that could make the distance useful for applications in bioinformatics or text mining where scale-dependent n-gram overlap matters.
major comments (2)
- [Definition of d_ρ (abstract and §2)] Definition of d_ρ (abstract and §2): the angle θ(u,v) = arccos(⟨u,v⟩ / (|u||v|)) is undefined (0/0) whenever both n-gram vectors are the zero vector, which occurs for all n > max(len(s),len(t)). The manuscript must explicitly stipulate a convention (e.g., θ(0,0) := 0 or truncation of the sum at n ≤ min(len(s),len(t))) before the claimed proofs of the metric axioms—especially d(s,t)=0 ⇔ s=t and the triangle inequality—can be considered complete. The stability results under tandem repeats likewise rest on this well-definedness.
- [Metric proof (§3)] Metric proof (presumably §3): even after adopting a convention for θ(0,0), the manuscript should verify that the resulting weighted sum remains a metric for every ρ in the stated range and is independent of the particular convention chosen. The current claim that the construction is “parameter-free” once ρ is fixed appears to require an explicit check that different conventions for zero vectors do not alter the triangle inequality or the zero-distance characterization.
minor comments (2)
- [Benchmarks] Benchmarks section: state the precise range of ρ values tested and whether the DBSCAN performance is sensitive to ρ; include a short ablation showing how the linear-time algorithm scales with string length on the reported data sets.
- [Notation] Notation: clarify whether v_n(s) denotes raw counts or normalized frequency vectors, and whether the inner product is the standard dot product on ℝ^{|Σ|^n}.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments, which help strengthen the rigor of the definition and proofs. We address each major comment below and will incorporate the necessary clarifications and verifications in the revised manuscript.
read point-by-point responses
-
Referee: [Definition of d_ρ (abstract and §2)] Definition of d_ρ (abstract and §2): the angle θ(u,v) = arccos(⟨u,v⟩ / (|u||v|)) is undefined (0/0) whenever both n-gram vectors are the zero vector, which occurs for all n > max(len(s),len(t)). The manuscript must explicitly stipulate a convention (e.g., θ(0,0) := 0 or truncation of the sum at n ≤ min(len(s),len(t))) before the claimed proofs of the metric axioms—especially d(s,t)=0 ⇔ s=t and the triangle inequality—can be considered complete. The stability results under tandem repeats likewise rest on this well-definedness.
Authors: We agree that an explicit convention is required for well-definedness. In the revised manuscript we will stipulate θ(0,0) := 0 and note that the infinite sum may equivalently be truncated at n = max(len(s), len(t)) + 1, since all subsequent terms vanish. With this convention the expression for d_ρ is unambiguously defined for every pair of strings. We will then re-verify the metric axioms (non-negativity, symmetry, d(s,t)=0 ⇔ s=t, and the triangle inequality) and the tandem-repeat stability statements under the adopted convention, adding the necessary short arguments in §2 and §3. revision: yes
-
Referee: [Metric proof (§3)] Metric proof (presumably §3): even after adopting a convention for θ(0,0), the manuscript should verify that the resulting weighted sum remains a metric for every ρ in the stated range and is independent of the particular convention chosen. The current claim that the construction is “parameter-free” once ρ is fixed appears to require an explicit check that different conventions for zero vectors do not alter the triangle inequality or the zero-distance characterization.
Authors: We will add an explicit verification that, once θ(0,0) := 0 is fixed, d_ρ satisfies the metric axioms for all ρ ∈ (0,1). Because the contribution of every term with n > max(len(s),len(t)) is identically zero under this convention, any alternative convention that also assigns a finite value to θ(0,0) yields the identical numerical value of d_ρ; consequently the triangle inequality and the characterization d_ρ(s,t)=0 ⇔ s=t are unaffected by the choice. The proofs in §3 will be expanded with a short paragraph establishing this independence and confirming that the weighted sum of angle distances remains a metric on the stated parameter range. The “parameter-free” phrasing will be clarified to mean that, for any fixed ρ, the distance is completely determined once the convention is stated. revision: yes
Circularity Check
No circularity: definition and proofs are self-contained from standard components
full rationale
The paper constructs d_ρ explicitly as a weighted sum of angle distances θ on n-gram count vectors, using the standard arccos inner-product formula. Metric axioms and stability properties (including tandem-repeat robustness) are stated to be proved directly from this definition and the properties of cosine angle, without any reduction to fitted parameters, self-citations that bear the load, or renaming of prior results. No equation is shown to equal its own input by construction, and the derivation relies on independent verification of the triangle inequality and isometry characterization rather than circular self-reference. The zero-vector convention issue raised externally is a potential definitional clarification but does not create a circularity in the claimed chain.
Axiom & Free-Parameter Ledger
free parameters (1)
- ρ
Reference graph
Works this paper leans on
-
[1]
Cavnar and John M
William B. Cavnar and John M. Trenkle,N-gram-based text categorization, Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, NV), 1994
1994
-
[2]
Rudi Cilibrasi and Paul M. B. Vit´ anyi,Clustering by compression, IEEE Transactions on Information Theory51(2005), no. 4, 1523–1545
2005
-
[3]
Damerau,A technique for computer detection and correction of spelling errors, Communications of the ACM7(1964), no
Fred J. Damerau,A technique for computer detection and correction of spelling errors, Communications of the ACM7(1964), no. 3, 171–176
1964
-
[4]
Hamming,Error detecting and error correcting codes, The Bell System Technical Journal 29(1950), no
Richard W. Hamming,Error detecting and error correcting codes, The Bell System Technical Journal 29(1950), no. 2, 147–160
1950
-
[5]
7, 2002, pp
Christina Leslie, Eleazar Eskin, and William Stafford Noble,The spectrum kernel: A string kernel for SVM protein classification, Pacific Symposium on Biocomputing, vol. 7, 2002, pp. 564–575
2002
-
[6]
4, 467–476
Christina Leslie, Eleazar Eskin, Jason Weston, and William Stafford Noble,Mismatch string kernels for discriminative protein classification, Bioinformatics20(2004), no. 4, 467–476
2004
-
[7]
Levenshtein,Binary codes capable of correcting deletions, insertions, and reversals, Doklady Akademii Nauk SSSR163(1965), no
Vladimir I. Levenshtein,Binary codes capable of correcting deletions, insertions, and reversals, Doklady Akademii Nauk SSSR163(1965), no. 4, 845–848, In Russian
1965
-
[8]
Douglas Lind and Brian Marcus,An introduction to symbolic dynamics and coding, 2 ed., Cambridge University Press, 2021
2021
-
[9]
Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins,Text classifi- cation using string kernels, Journal of Machine Learning Research2(2002), 419–444. 24 GRANT MOLNAR Dataset Family Median pairwise time splice edit2.5s splice kgram angle1.2m splice js kgram3.2m splice weighted angle19.3m strseq edit2.1s strseq kgram angle39....
2002
-
[10]
Manning, Prabhakar Raghavan, and Hinrich Sch¨ utze,Introduction to information re- trieval, Cambridge University Press, 2008
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch¨ utze,Introduction to information re- trieval, Cambridge University Press, 2008
2008
-
[11]
22, 2026
Grant Molnar,weighted-angle-distance: reference implementation and experimental pipeline, https://github.com/grantmolnar/weighted-angle-distance, 2026, Accessed Apr. 22, 2026. A WEIGHTED ANGLE DISTANCE ON STRINGS 25 Figure 2.Best NMI achieved by each distance family onsplice
2026
-
[12]
Ondov, Todd J
Brian D. Ondov, Todd J. Treangen, P´ all Melsted, Adam B. Mallonee, Nicholas H. Bergman, Sergey Ko- ren, and Adam M. Phillippy,Mash: Fast genome and metagenome distance estimation using MinHash, Genome Biology17(2016), no. 1, 132
2016
-
[13]
Otu and Khalid Sayood,A new sequence distance measure for phylogenetic tree construction, Bioinformatics19(2003), no
Hasan H. Otu and Khalid Sayood,A new sequence distance measure for phylogenetic tree construction, Bioinformatics19(2003), no. 16, 2122–2130
2003
-
[14]
17, 2025
Franco Pappalardo,strsimpy: String similarity library (includes cosine onn-gram count vectors), https://pypi.org/project/strsimpy/, 2024, Accessed Nov. 17, 2025
2024
-
[15]
Huszar, and Hari Iyer,A new string edit distance and applications, Algorithms15(2022), no
Taylor Petty, Jan Hannig, Tunde I. Huszar, and Hari Iyer,A new string edit distance and applications, Algorithms15(2022), no. 7, 242
2022
-
[16]
11, 613–620
G´ erard Salton, Anita Wong, and Chung-Shu Yang,A vector space model for automatic indexing, Com- munications of the ACM18(1975), no. 11, 613–620
1975
-
[17]
Shields,The ergodic theory of discrete sample paths, Graduate Studies in Mathematics, vol
Paul C. Shields,The ergodic theory of discrete sample paths, Graduate Studies in Mathematics, vol. 13, American Mathematical Society, 1996
1996
-
[18]
Suppl 10, S7
S¨ oren Sonnenburg, Gabriele Schweikert, Philipp Philips, Johannes Behr, and Gunnar R¨ atsch,Accurate splice site prediction using support vector machines, BMC Bioinformatics8(2007), no. Suppl 10, S7
2007
-
[19]
1, 191–211
Esko Ukkonen,Approximate string-matching with q-grams and maximal matches, Theoretical Computer Science92(1992), no. 1, 191–211
1992
-
[20]
Wagner and Michael J
Robert A. Wagner and Michael J. Fischer,The string-to-string correction problem, Journal of the ACM 21(1974), no. 1, 168–173
1974
-
[21]
Girgis, Guillaume Bernard, Chris-Andre Leimeister, Kujin Tang, Thomas Dencker, Anna Katharina Lau, Sophie R¨ ohling, Jae Jin Choi, Michael S
Andrzej Zielezinski, Hani Z. Girgis, Guillaume Bernard, Chris-Andre Leimeister, Kujin Tang, Thomas Dencker, Anna Katharina Lau, Sophie R¨ ohling, Jae Jin Choi, Michael S. Waterman, Matteo Comin, Sung-Hou Kim, Susana Vinga, Jonas S. Almeida, Cheong Xin Chan, Benjamin T. James, Fengzhu Sun, Burkhard Morgenstern, and Wojciech M. Karlowski,Benchmarking of ali...
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.