Revisiting $O(n \log \log n)$ chaining for anchored edit distance

Nicola Rizzo; Ragnar Groot Koerkamp

arxiv: 2606.03929 · v1 · pith:GEUFLFO4new · submitted 2026-06-02 · 💻 cs.DS

Revisiting O(n log log n) chaining for anchored edit distance

Nicola Rizzo , Ragnar Groot Koerkamp This is my paper

Pith reviewed 2026-06-28 07:59 UTC · model grok-4.3

classification 💻 cs.DS

keywords colinear chaininganchored edit distancesequence alignmentedit distancegenome mappingread mappers

0 comments

The pith

Anchored edit distance chaining runs in O(n log log n) time and O(n) space by merging two 1990s techniques.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Colinear chaining finds the lowest-cost way to connect sequence fragments while accounting for gaps and overlaps. The paper proves that the anchored edit distance version of this problem, which uses a specific overlap cost combined with an L-infinity gap cost, can be solved in O(n log log n) time. This improves the prior O(n log cubed n) bound. The improvement comes from directly combining an existing method for gap costs with an existing method for overlap costs. The result matters for genome alignment because chaining is a core step in modern read mappers that handle millions of fragments.

Core claim

By combining the gap-cost computation of Chao and Miller with the overlap-cost computation of Baker and Giancarlo, anchored edit distance under the Delta_diag overlap cost and L_infinity gap cost can be solved in O(n log log n) time using O(n) space.

What carries the argument

Direct integration of the 1995 gap-cost data structures with the 1998 overlap-cost data structures to handle both gap penalties and forbidden overlaps in one pass.

If this is right

The algorithm uses linear space.
A simplified O(n log n) implementation called llchain runs on instances with millions of anchors.
llchain is faster than prior methods on average for 3 million anchors and for HiFi read mapping against the human genome.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to other cost models that mix gap and overlap penalties in sequence alignment.
Faster chaining could reduce the runtime bottleneck in seed-chain-extend pipelines for long-read mapping.

Load-bearing premise

The gap-cost and overlap-cost techniques from the two cited papers combine without introducing extra logarithmic factors or space overhead.

What would settle it

An input of n fragments on which the combined algorithm requires more than O(n log log n) time or more than O(n) space.

read the original abstract

Colinear chaining is a classical heuristic for sequence alignment: it enables scalable genome comparison and is a main component of many state-of-the-art read mappers based on seed-chain-extend. The earliest $O(n \log \log n)$ time algorithms by Eppstein et al. (J. ACM, 1992) chained $n$ fragments between two sequences $T$ and $Q$ while minimizing a gap cost based on the diagonal distance $\Delta_{\text{diag}}$ between consecutive fragments. They also forbid fragment overlaps, which are essential in current chaining formulations: in long-read mapping, overlaps improve sensitivity and avoid restrictions on the fragment class considered. Jain, Gibney, and Thankachan (J. Comput. Biol. 2022) recently combined a $\Delta_{\text{diag}} = |\Delta_T -\Delta_Q|$ overlap cost with the classic $L_\infty = \max(\Delta_T , \Delta_Q)$ gap cost that takes the maximum between the horizontal and vertical gap between the fragments and they proved that chaining under this cost model is equivalent to the anchored edit distance. We improve the existing $O(n \log^3 n)$-time algorithm for anchored edit distance to $O(n \log \log n)$ time in $O(n)$ space, by combining the gap-cost computation of Chao and Miller (Algorithmica, 1995) with the overlap-cost computation of Baker and Giancarlo (ESA, 1998). By developing llchain, a simpler $O(n \log n)$-time implementation of our method, we show how chaining algorithms that might have been recently overlooked by the bioinformatics community scale competitively to millions of fragments and large genomes. On average, llchain is $10\times$ faster than other methods on instances with $3\,000\,000$ anchors, and over $3\times$ faster on MEMs between HiFi reads and a reference human genome.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They combine the Chao-Miller gap structure with Baker-Giancarlo overlap handling to reach the O(n log log n) bound for anchored edit distance, plus a simpler O(n log n) variant that runs fast on large instances.

read the letter

The main point is that this paper delivers the O(n log log n) time bound for anchored edit distance under the L_infty gap plus overlap cost model. They do it by running the 1995 Chao-Miller structure and the 1998 Baker-Giancarlo structure together on the same fragments, with a merged query that keeps the per-fragment cost at O(log log n) and total space at O(n).

The full manuscript spells out the simultaneous maintenance and the query procedure, so the combination works without extra log factors. That is the actual advance: showing the two old pieces fit cleanly for this exact cost model. They also give llchain, a stripped-down O(n log n) version that is easier to code and still beats current chaining tools by 10x on three million anchors and over 3x on HiFi reads against the human genome.

The work is mostly reuse of existing data structures rather than new primitives, which limits how broad the novelty is. The experiments are on realistic sizes but cover only the chaining step, not the full mapper pipeline. Space is claimed O(n), though constant factors would need watching on the largest genomes.

People building or tuning seed-chain-extend aligners will get direct value from the bounds and the implementation. The paper is coherent on its own terms and the central claim holds up once the full algorithm is examined.

Send it to peer review. The combination is now explicit and the practical numbers are strong enough to matter.

Referee Report

0 major / 2 minor

Summary. The manuscript claims an O(n log log n)-time, O(n)-space algorithm for anchored edit distance (colinear chaining under L_infty gap cost and Delta_diag overlap cost) obtained by combining the gap-cost data structure of Chao and Miller (Algorithmica 1995) with the overlap-cost data structure of Baker and Giancarlo (ESA 1998). It also supplies an explicit simpler O(n log n) implementation called llchain, together with an analysis of a merged query procedure running in O(log log n) per fragment and empirical timings showing 3-10x speedups over prior methods on instances with up to 3 million anchors.

Significance. If the stated combination and merged-query analysis hold, the result revives overlooked classical O(n log log n) chaining techniques for a cost model directly relevant to modern seed-chain-extend pipelines in long-read mapping. The manuscript supplies an explicit algorithm description, direct complexity analysis of the simultaneous data-structure maintenance, and reproducible performance data on large genomes; these elements strengthen the contribution beyond the abstract claim.

minor comments (2)

[Abstract] Abstract: the O(n log n) llchain variant is introduced only in the final sentence; a parenthetical note on its complexity would improve readability.
Section 4 (or wherever the merged query is defined): the pseudocode or high-level description of how the two structures are maintained simultaneously could include an explicit statement of the per-fragment query time to make the O(log log n) bound immediately visible.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript, the recognition of its significance for modern seed-chain-extend pipelines, and the recommendation to accept. We are pleased that the explicit algorithm description, complexity analysis, and empirical results were viewed as strengthening the contribution.

Circularity Check

0 steps flagged

No significant circularity; derivation assembles independent external algorithms

full rationale

The paper's central claim is an algorithmic combination of the Chao-Miller (1995) gap-cost data structure and Baker-Giancarlo (1998) overlap-cost data structure, both from unrelated prior authors. The manuscript provides an explicit merged query procedure and direct O(log log n) per-fragment complexity analysis. No equations reduce to inputs by construction, no self-citations are load-bearing, and no ansatz or uniqueness result is smuggled from the authors' own prior work. The derivation is self-contained against the cited external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the correctness of two cited prior algorithms and standard assumptions about range-query data structures; no new free parameters or invented entities are introduced.

axioms (1)

standard math Correctness of the gap-cost computation in Chao and Miller (Algorithmica, 1995) and the overlap-cost computation in Baker and Giancarlo (ESA, 1998).
The new algorithm is defined as their direct combination.

pith-pipeline@v0.9.1-grok · 5886 in / 1292 out tokens · 23531 ms · 2026-06-28T07:59:50.092712+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

URL: https://web.archive.org/web/20250708030638/https://www.iscb.org/cms_ addon/conferences/ismbeccb2004/short%20papers/19.pdf

Short paper. URL: https://web.archive.org/web/20250708030638/https://www.iscb.org/cms_ addon/conferences/ismbeccb2004/short%20papers/19.pdf. 2 Mohamed Ibrahim Abouelhoda and Enno Ohlebusch. Chaining algorithms for multiple genome comparison.J. Discrete Algorithms, 3(2-4):321–341, 2005.doi:10.1016/J.JDA.2004.08.011. 3 Brenda S. Baker and Raffaele Giancarlo...

work page doi:10.1016/j.jda.2004.08.011 2005
[2]

McCreight

4 Rudolf Bayer and Edward M. McCreight. Organization and maintenance of large ordered indexes.Acta Informatica, 1(3):173–189, 1972.doi:10.1007/bf00288683. 5 Gary Benson, Avivit Levy, S. Maimoni, D. Noifeld, and B. Riva Shalom. LCSk: A refined similarity measure.Theor. Comput. Sci., 638:11–26, 2016.doi:10.1016/J.TCS.2015.11.026. 6 Gary Benson, Avivit Levy,...

work page doi:10.1007/bf00288683 1972
[3]

12 Arthur L

doi: 10.1007/978-3-540-77974-2. 12 Arthur L. Delcher, Simon Kasif, Robert D. Fleischmann, Jeremy Peterson, Owen White, and Steven L. Salzberg. Alignment of whole genomes.Nucleic Acids Research, 27(11):2369–2376, January 1999.doi:10.1093/nar/27.11.2369. 13 David Eppstein, Zvi Galil, Raffaele Giancarlo, and Giuseppe F. Italiano. Sparse dynamic programming I...

work page doi:10.1007/978-3-540-77974-2 1999
[4]

Guibas and Robert Sedgewick

doi:10.1109/SFCS.1978.3. 17 Dan Gusfield.Algorithms on Strings, Trees, and Sequences - Computer Science and Computa- tional Biology. Cambridge University Press, 1997.doi:10.1017/CBO9780511574931. 18 Yijie Han. Deterministic sorting inO(nlog logn )time and linear space. In John H. Reif, editor,Proceedings on 34th Annual ACM Symposium on Theory of Computing...

work page doi:10.1109/sfcs.1978.3 1978
[5]

20 Chirag Jain, Daniel Gibney, and Sharma V

URL:https://www.bx.psu.edu/~rsharris/rsharris_phd_ thesis_2007.pdf. 20 Chirag Jain, Daniel Gibney, and Sharma V. Thankachan. Algorithms for colinear chaining with overlaps and gap costs.J. Comput. Biol., 29(11):1237–1251, 2022.doi:10.1089/CMB. 2022.0266. 21 Donald B. Johnson. A priority queue in which initialization and queue operations take O(log logD)ti...

work page doi:10.1089/cmb 2022
[6]

22 Stefan Kurtz, Adam Phillippy, Arthur L

doi: 10.1007/bf01786986. 22 Stefan Kurtz, Adam Phillippy, Arthur L. Delcher, Michael Smoot, Martin Shumway, Corina Antonescu, and Steven L. Salzberg. Versatile and open software for comparing large genomes. Genome biology, 5(2):R12, 2004.doi:10.1186/gb-2004-5-2-r12. 23 Heng Li. Minimap2: pairwise alignment for nucleotide sequences.Bioinform., 34(18):3094–...

work page doi:10.1007/bf01786986 2004
[7]

and Mikheenko, Alla and Vollger, Mitchell R

URL:http://dl.acm.org/citation.cfm?id=313651.313661. 29 Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V. Bzikadze, Alla Mikheenko, Mitchell R. Vollger, Nicolas Altemose, Lev Uralsky, Ariel Gershman, Sergey Aganezov, Savannah J. Hoyt, Mark Diekhans, Glennis A. Logsdon, Michael Alonge, Stylianos E. Antonarakis, Matthew Borchers, Gerard G. ...

work page doi:10.1126/science.abj6987 2022
[8]

33 Filip Pavetić, Ivan Katanić, Gustav Matula, Goran Žužić, and Mile Šikić

doi: 10.1186/1748-7188-6-4. 33 Filip Pavetić, Ivan Katanić, Gustav Matula, Goran Žužić, and Mile Šikić. Fast and simple algorithms for computing bothLCS k and LCS k+.arXiv,

work page doi:10.1186/1748-7188-6-4
[9]

doi:10.48550/ARXIV.1705. 07279. 34 Filip Pavetić, Goran Žužić, and Mile Šikić.LCSk++: Practical similarity metric for long strings.arXiv, 2014.doi:10.48550/ARXIV.1407.2407. 35 Jingwen Ren and Mark J. P. Chaisson. lra: A long read aligner for sequences and contigs. PLoS Comput. Biol., 17(6), 2021.doi:10.1371/JOURNAL.PCBI.1009078. 36 Arang Rhie, Sergey Nurk...

work page doi:10.48550/arxiv.1705 2014
[10]

39 Kristoffer Sahlin, Thomas Baudeau, Bastien Cazaux, and Camille Marchet

arXiv:2506.11750, doi:10.48550/ARXIV.2506.11750. 39 Kristoffer Sahlin, Thomas Baudeau, Bastien Cazaux, and Camille Marchet. A survey of mapping algorithms in the long-reads era.Genome Biology, 24(1):133,

work page doi:10.48550/arxiv.2506.11750
[11]

Accurate spliced alignment of long RNA sequencing reads

40 Kristoffer Sahlin and Veli Mäkinen. Accurate spliced alignment of long RNA sequencing reads. Bioinform., 37(24):4643–4651, 2021.doi:10.1093/BIOINFORMATICS/BTAB540. 41 P. van Emde Boas. Preserving order in a forest in less than logarithmic time and linear space. Information Processing Letters, 6(3):80–82, June 1977.doi:10.1016/0020-0190(77)90031-x. 42 W...

work page doi:10.1093/bioinformatics/btab540 2021
[12]

44 Dan E. Willard. Log-logarithmic worst-case range queries are possible in spaceθ(n).Informa- tion Processing Letters, 17(2):81–84, August 1983.doi:10.1016/0020-0190(83)90075-3. 20 RevisitingO(nlog logn)Chaining for Anchored Edit Distance 45 Zheng Zhang, Balaji Raghavachari, Ross C. Hardison, and Webb Miller. Chaining multiple- alignment blocks.J. Comput...

work page doi:10.1016/0020-0190(83)90075-3 1983

[1] [1]

URL: https://web.archive.org/web/20250708030638/https://www.iscb.org/cms_ addon/conferences/ismbeccb2004/short%20papers/19.pdf

Short paper. URL: https://web.archive.org/web/20250708030638/https://www.iscb.org/cms_ addon/conferences/ismbeccb2004/short%20papers/19.pdf. 2 Mohamed Ibrahim Abouelhoda and Enno Ohlebusch. Chaining algorithms for multiple genome comparison.J. Discrete Algorithms, 3(2-4):321–341, 2005.doi:10.1016/J.JDA.2004.08.011. 3 Brenda S. Baker and Raffaele Giancarlo...

work page doi:10.1016/j.jda.2004.08.011 2005

[2] [2]

McCreight

4 Rudolf Bayer and Edward M. McCreight. Organization and maintenance of large ordered indexes.Acta Informatica, 1(3):173–189, 1972.doi:10.1007/bf00288683. 5 Gary Benson, Avivit Levy, S. Maimoni, D. Noifeld, and B. Riva Shalom. LCSk: A refined similarity measure.Theor. Comput. Sci., 638:11–26, 2016.doi:10.1016/J.TCS.2015.11.026. 6 Gary Benson, Avivit Levy,...

work page doi:10.1007/bf00288683 1972

[3] [3]

12 Arthur L

doi: 10.1007/978-3-540-77974-2. 12 Arthur L. Delcher, Simon Kasif, Robert D. Fleischmann, Jeremy Peterson, Owen White, and Steven L. Salzberg. Alignment of whole genomes.Nucleic Acids Research, 27(11):2369–2376, January 1999.doi:10.1093/nar/27.11.2369. 13 David Eppstein, Zvi Galil, Raffaele Giancarlo, and Giuseppe F. Italiano. Sparse dynamic programming I...

work page doi:10.1007/978-3-540-77974-2 1999

[4] [4]

Guibas and Robert Sedgewick

doi:10.1109/SFCS.1978.3. 17 Dan Gusfield.Algorithms on Strings, Trees, and Sequences - Computer Science and Computa- tional Biology. Cambridge University Press, 1997.doi:10.1017/CBO9780511574931. 18 Yijie Han. Deterministic sorting inO(nlog logn )time and linear space. In John H. Reif, editor,Proceedings on 34th Annual ACM Symposium on Theory of Computing...

work page doi:10.1109/sfcs.1978.3 1978

[5] [5]

20 Chirag Jain, Daniel Gibney, and Sharma V

URL:https://www.bx.psu.edu/~rsharris/rsharris_phd_ thesis_2007.pdf. 20 Chirag Jain, Daniel Gibney, and Sharma V. Thankachan. Algorithms for colinear chaining with overlaps and gap costs.J. Comput. Biol., 29(11):1237–1251, 2022.doi:10.1089/CMB. 2022.0266. 21 Donald B. Johnson. A priority queue in which initialization and queue operations take O(log logD)ti...

work page doi:10.1089/cmb 2022

[6] [6]

22 Stefan Kurtz, Adam Phillippy, Arthur L

doi: 10.1007/bf01786986. 22 Stefan Kurtz, Adam Phillippy, Arthur L. Delcher, Michael Smoot, Martin Shumway, Corina Antonescu, and Steven L. Salzberg. Versatile and open software for comparing large genomes. Genome biology, 5(2):R12, 2004.doi:10.1186/gb-2004-5-2-r12. 23 Heng Li. Minimap2: pairwise alignment for nucleotide sequences.Bioinform., 34(18):3094–...

work page doi:10.1007/bf01786986 2004

[7] [7]

and Mikheenko, Alla and Vollger, Mitchell R

URL:http://dl.acm.org/citation.cfm?id=313651.313661. 29 Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V. Bzikadze, Alla Mikheenko, Mitchell R. Vollger, Nicolas Altemose, Lev Uralsky, Ariel Gershman, Sergey Aganezov, Savannah J. Hoyt, Mark Diekhans, Glennis A. Logsdon, Michael Alonge, Stylianos E. Antonarakis, Matthew Borchers, Gerard G. ...

work page doi:10.1126/science.abj6987 2022

[8] [8]

33 Filip Pavetić, Ivan Katanić, Gustav Matula, Goran Žužić, and Mile Šikić

doi: 10.1186/1748-7188-6-4. 33 Filip Pavetić, Ivan Katanić, Gustav Matula, Goran Žužić, and Mile Šikić. Fast and simple algorithms for computing bothLCS k and LCS k+.arXiv,

work page doi:10.1186/1748-7188-6-4

[9] [9]

doi:10.48550/ARXIV.1705. 07279. 34 Filip Pavetić, Goran Žužić, and Mile Šikić.LCSk++: Practical similarity metric for long strings.arXiv, 2014.doi:10.48550/ARXIV.1407.2407. 35 Jingwen Ren and Mark J. P. Chaisson. lra: A long read aligner for sequences and contigs. PLoS Comput. Biol., 17(6), 2021.doi:10.1371/JOURNAL.PCBI.1009078. 36 Arang Rhie, Sergey Nurk...

work page doi:10.48550/arxiv.1705 2014

[10] [10]

39 Kristoffer Sahlin, Thomas Baudeau, Bastien Cazaux, and Camille Marchet

arXiv:2506.11750, doi:10.48550/ARXIV.2506.11750. 39 Kristoffer Sahlin, Thomas Baudeau, Bastien Cazaux, and Camille Marchet. A survey of mapping algorithms in the long-reads era.Genome Biology, 24(1):133,

work page doi:10.48550/arxiv.2506.11750

[11] [11]

Accurate spliced alignment of long RNA sequencing reads

40 Kristoffer Sahlin and Veli Mäkinen. Accurate spliced alignment of long RNA sequencing reads. Bioinform., 37(24):4643–4651, 2021.doi:10.1093/BIOINFORMATICS/BTAB540. 41 P. van Emde Boas. Preserving order in a forest in less than logarithmic time and linear space. Information Processing Letters, 6(3):80–82, June 1977.doi:10.1016/0020-0190(77)90031-x. 42 W...

work page doi:10.1093/bioinformatics/btab540 2021

[12] [12]

44 Dan E. Willard. Log-logarithmic worst-case range queries are possible in spaceθ(n).Informa- tion Processing Letters, 17(2):81–84, August 1983.doi:10.1016/0020-0190(83)90075-3. 20 RevisitingO(nlog logn)Chaining for Anchored Edit Distance 45 Zheng Zhang, Balaji Raghavachari, Ross C. Hardison, and Webb Miller. Chaining multiple- alignment blocks.J. Comput...

work page doi:10.1016/0020-0190(83)90075-3 1983