Revisiting O(n log log n) chaining for anchored edit distance
Pith reviewed 2026-06-28 07:59 UTC · model grok-4.3
The pith
Anchored edit distance chaining runs in O(n log log n) time and O(n) space by merging two 1990s techniques.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By combining the gap-cost computation of Chao and Miller with the overlap-cost computation of Baker and Giancarlo, anchored edit distance under the Delta_diag overlap cost and L_infinity gap cost can be solved in O(n log log n) time using O(n) space.
What carries the argument
Direct integration of the 1995 gap-cost data structures with the 1998 overlap-cost data structures to handle both gap penalties and forbidden overlaps in one pass.
If this is right
- The algorithm uses linear space.
- A simplified O(n log n) implementation called llchain runs on instances with millions of anchors.
- llchain is faster than prior methods on average for 3 million anchors and for HiFi read mapping against the human genome.
Where Pith is reading between the lines
- The approach may extend to other cost models that mix gap and overlap penalties in sequence alignment.
- Faster chaining could reduce the runtime bottleneck in seed-chain-extend pipelines for long-read mapping.
Load-bearing premise
The gap-cost and overlap-cost techniques from the two cited papers combine without introducing extra logarithmic factors or space overhead.
What would settle it
An input of n fragments on which the combined algorithm requires more than O(n log log n) time or more than O(n) space.
read the original abstract
Colinear chaining is a classical heuristic for sequence alignment: it enables scalable genome comparison and is a main component of many state-of-the-art read mappers based on seed-chain-extend. The earliest $O(n \log \log n)$ time algorithms by Eppstein et al. (J. ACM, 1992) chained $n$ fragments between two sequences $T$ and $Q$ while minimizing a gap cost based on the diagonal distance $\Delta_{\text{diag}}$ between consecutive fragments. They also forbid fragment overlaps, which are essential in current chaining formulations: in long-read mapping, overlaps improve sensitivity and avoid restrictions on the fragment class considered. Jain, Gibney, and Thankachan (J. Comput. Biol. 2022) recently combined a $\Delta_{\text{diag}} = |\Delta_T -\Delta_Q|$ overlap cost with the classic $L_\infty = \max(\Delta_T , \Delta_Q)$ gap cost that takes the maximum between the horizontal and vertical gap between the fragments and they proved that chaining under this cost model is equivalent to the anchored edit distance. We improve the existing $O(n \log^3 n)$-time algorithm for anchored edit distance to $O(n \log \log n)$ time in $O(n)$ space, by combining the gap-cost computation of Chao and Miller (Algorithmica, 1995) with the overlap-cost computation of Baker and Giancarlo (ESA, 1998). By developing llchain, a simpler $O(n \log n)$-time implementation of our method, we show how chaining algorithms that might have been recently overlooked by the bioinformatics community scale competitively to millions of fragments and large genomes. On average, llchain is $10\times$ faster than other methods on instances with $3\,000\,000$ anchors, and over $3\times$ faster on MEMs between HiFi reads and a reference human genome.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims an O(n log log n)-time, O(n)-space algorithm for anchored edit distance (colinear chaining under L_infty gap cost and Delta_diag overlap cost) obtained by combining the gap-cost data structure of Chao and Miller (Algorithmica 1995) with the overlap-cost data structure of Baker and Giancarlo (ESA 1998). It also supplies an explicit simpler O(n log n) implementation called llchain, together with an analysis of a merged query procedure running in O(log log n) per fragment and empirical timings showing 3-10x speedups over prior methods on instances with up to 3 million anchors.
Significance. If the stated combination and merged-query analysis hold, the result revives overlooked classical O(n log log n) chaining techniques for a cost model directly relevant to modern seed-chain-extend pipelines in long-read mapping. The manuscript supplies an explicit algorithm description, direct complexity analysis of the simultaneous data-structure maintenance, and reproducible performance data on large genomes; these elements strengthen the contribution beyond the abstract claim.
minor comments (2)
- [Abstract] Abstract: the O(n log n) llchain variant is introduced only in the final sentence; a parenthetical note on its complexity would improve readability.
- Section 4 (or wherever the merged query is defined): the pseudocode or high-level description of how the two structures are maintained simultaneously could include an explicit statement of the per-fragment query time to make the O(log log n) bound immediately visible.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the manuscript, the recognition of its significance for modern seed-chain-extend pipelines, and the recommendation to accept. We are pleased that the explicit algorithm description, complexity analysis, and empirical results were viewed as strengthening the contribution.
Circularity Check
No significant circularity; derivation assembles independent external algorithms
full rationale
The paper's central claim is an algorithmic combination of the Chao-Miller (1995) gap-cost data structure and Baker-Giancarlo (1998) overlap-cost data structure, both from unrelated prior authors. The manuscript provides an explicit merged query procedure and direct O(log log n) per-fragment complexity analysis. No equations reduce to inputs by construction, no self-citations are load-bearing, and no ansatz or uniqueness result is smuggled from the authors' own prior work. The derivation is self-contained against the cited external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Correctness of the gap-cost computation in Chao and Miller (Algorithmica, 1995) and the overlap-cost computation in Baker and Giancarlo (ESA, 1998).
Reference graph
Works this paper leans on
-
[1]
Short paper. URL: https://web.archive.org/web/20250708030638/https://www.iscb.org/cms_ addon/conferences/ismbeccb2004/short%20papers/19.pdf. 2 Mohamed Ibrahim Abouelhoda and Enno Ohlebusch. Chaining algorithms for multiple genome comparison.J. Discrete Algorithms, 3(2-4):321–341, 2005.doi:10.1016/J.JDA.2004.08.011. 3 Brenda S. Baker and Raffaele Giancarlo...
-
[2]
4 Rudolf Bayer and Edward M. McCreight. Organization and maintenance of large ordered indexes.Acta Informatica, 1(3):173–189, 1972.doi:10.1007/bf00288683. 5 Gary Benson, Avivit Levy, S. Maimoni, D. Noifeld, and B. Riva Shalom. LCSk: A refined similarity measure.Theor. Comput. Sci., 638:11–26, 2016.doi:10.1016/J.TCS.2015.11.026. 6 Gary Benson, Avivit Levy,...
-
[3]
doi: 10.1007/978-3-540-77974-2. 12 Arthur L. Delcher, Simon Kasif, Robert D. Fleischmann, Jeremy Peterson, Owen White, and Steven L. Salzberg. Alignment of whole genomes.Nucleic Acids Research, 27(11):2369–2376, January 1999.doi:10.1093/nar/27.11.2369. 13 David Eppstein, Zvi Galil, Raffaele Giancarlo, and Giuseppe F. Italiano. Sparse dynamic programming I...
-
[4]
doi:10.1109/SFCS.1978.3. 17 Dan Gusfield.Algorithms on Strings, Trees, and Sequences - Computer Science and Computa- tional Biology. Cambridge University Press, 1997.doi:10.1017/CBO9780511574931. 18 Yijie Han. Deterministic sorting inO(nlog logn )time and linear space. In John H. Reif, editor,Proceedings on 34th Annual ACM Symposium on Theory of Computing...
-
[5]
20 Chirag Jain, Daniel Gibney, and Sharma V
URL:https://www.bx.psu.edu/~rsharris/rsharris_phd_ thesis_2007.pdf. 20 Chirag Jain, Daniel Gibney, and Sharma V. Thankachan. Algorithms for colinear chaining with overlaps and gap costs.J. Comput. Biol., 29(11):1237–1251, 2022.doi:10.1089/CMB. 2022.0266. 21 Donald B. Johnson. A priority queue in which initialization and queue operations take O(log logD)ti...
work page doi:10.1089/cmb 2022
-
[6]
22 Stefan Kurtz, Adam Phillippy, Arthur L
doi: 10.1007/bf01786986. 22 Stefan Kurtz, Adam Phillippy, Arthur L. Delcher, Michael Smoot, Martin Shumway, Corina Antonescu, and Steven L. Salzberg. Versatile and open software for comparing large genomes. Genome biology, 5(2):R12, 2004.doi:10.1186/gb-2004-5-2-r12. 23 Heng Li. Minimap2: pairwise alignment for nucleotide sequences.Bioinform., 34(18):3094–...
-
[7]
and Mikheenko, Alla and Vollger, Mitchell R
URL:http://dl.acm.org/citation.cfm?id=313651.313661. 29 Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V. Bzikadze, Alla Mikheenko, Mitchell R. Vollger, Nicolas Altemose, Lev Uralsky, Ariel Gershman, Sergey Aganezov, Savannah J. Hoyt, Mark Diekhans, Glennis A. Logsdon, Michael Alonge, Stylianos E. Antonarakis, Matthew Borchers, Gerard G. ...
-
[8]
33 Filip Pavetić, Ivan Katanić, Gustav Matula, Goran Žužić, and Mile Šikić
doi: 10.1186/1748-7188-6-4. 33 Filip Pavetić, Ivan Katanić, Gustav Matula, Goran Žužić, and Mile Šikić. Fast and simple algorithms for computing bothLCS k and LCS k+.arXiv,
-
[9]
doi:10.48550/ARXIV.1705. 07279. 34 Filip Pavetić, Goran Žužić, and Mile Šikić.LCSk++: Practical similarity metric for long strings.arXiv, 2014.doi:10.48550/ARXIV.1407.2407. 35 Jingwen Ren and Mark J. P. Chaisson. lra: A long read aligner for sequences and contigs. PLoS Comput. Biol., 17(6), 2021.doi:10.1371/JOURNAL.PCBI.1009078. 36 Arang Rhie, Sergey Nurk...
-
[10]
39 Kristoffer Sahlin, Thomas Baudeau, Bastien Cazaux, and Camille Marchet
arXiv:2506.11750, doi:10.48550/ARXIV.2506.11750. 39 Kristoffer Sahlin, Thomas Baudeau, Bastien Cazaux, and Camille Marchet. A survey of mapping algorithms in the long-reads era.Genome Biology, 24(1):133,
-
[11]
Accurate spliced alignment of long RNA sequencing reads
40 Kristoffer Sahlin and Veli Mäkinen. Accurate spliced alignment of long RNA sequencing reads. Bioinform., 37(24):4643–4651, 2021.doi:10.1093/BIOINFORMATICS/BTAB540. 41 P. van Emde Boas. Preserving order in a forest in less than logarithmic time and linear space. Information Processing Letters, 6(3):80–82, June 1977.doi:10.1016/0020-0190(77)90031-x. 42 W...
-
[12]
44 Dan E. Willard. Log-logarithmic worst-case range queries are possible in spaceθ(n).Informa- tion Processing Letters, 17(2):81–84, August 1983.doi:10.1016/0020-0190(83)90075-3. 20 RevisitingO(nlog logn)Chaining for Anchored Edit Distance 45 Zheng Zhang, Balaji Raghavachari, Ross C. Hardison, and Webb Miller. Chaining multiple- alignment blocks.J. Comput...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.