pith. sign in

arxiv: 2606.03929 · v1 · pith:GEUFLFO4new · submitted 2026-06-02 · 💻 cs.DS

Revisiting O(n log log n) chaining for anchored edit distance

Pith reviewed 2026-06-28 07:59 UTC · model grok-4.3

classification 💻 cs.DS
keywords colinear chaininganchored edit distancesequence alignmentedit distancegenome mappingread mappers
0
0 comments X

The pith

Anchored edit distance chaining runs in O(n log log n) time and O(n) space by merging two 1990s techniques.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Colinear chaining finds the lowest-cost way to connect sequence fragments while accounting for gaps and overlaps. The paper proves that the anchored edit distance version of this problem, which uses a specific overlap cost combined with an L-infinity gap cost, can be solved in O(n log log n) time. This improves the prior O(n log cubed n) bound. The improvement comes from directly combining an existing method for gap costs with an existing method for overlap costs. The result matters for genome alignment because chaining is a core step in modern read mappers that handle millions of fragments.

Core claim

By combining the gap-cost computation of Chao and Miller with the overlap-cost computation of Baker and Giancarlo, anchored edit distance under the Delta_diag overlap cost and L_infinity gap cost can be solved in O(n log log n) time using O(n) space.

What carries the argument

Direct integration of the 1995 gap-cost data structures with the 1998 overlap-cost data structures to handle both gap penalties and forbidden overlaps in one pass.

If this is right

  • The algorithm uses linear space.
  • A simplified O(n log n) implementation called llchain runs on instances with millions of anchors.
  • llchain is faster than prior methods on average for 3 million anchors and for HiFi read mapping against the human genome.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may extend to other cost models that mix gap and overlap penalties in sequence alignment.
  • Faster chaining could reduce the runtime bottleneck in seed-chain-extend pipelines for long-read mapping.

Load-bearing premise

The gap-cost and overlap-cost techniques from the two cited papers combine without introducing extra logarithmic factors or space overhead.

What would settle it

An input of n fragments on which the combined algorithm requires more than O(n log log n) time or more than O(n) space.

read the original abstract

Colinear chaining is a classical heuristic for sequence alignment: it enables scalable genome comparison and is a main component of many state-of-the-art read mappers based on seed-chain-extend. The earliest $O(n \log \log n)$ time algorithms by Eppstein et al. (J. ACM, 1992) chained $n$ fragments between two sequences $T$ and $Q$ while minimizing a gap cost based on the diagonal distance $\Delta_{\text{diag}}$ between consecutive fragments. They also forbid fragment overlaps, which are essential in current chaining formulations: in long-read mapping, overlaps improve sensitivity and avoid restrictions on the fragment class considered. Jain, Gibney, and Thankachan (J. Comput. Biol. 2022) recently combined a $\Delta_{\text{diag}} = |\Delta_T -\Delta_Q|$ overlap cost with the classic $L_\infty = \max(\Delta_T , \Delta_Q)$ gap cost that takes the maximum between the horizontal and vertical gap between the fragments and they proved that chaining under this cost model is equivalent to the anchored edit distance. We improve the existing $O(n \log^3 n)$-time algorithm for anchored edit distance to $O(n \log \log n)$ time in $O(n)$ space, by combining the gap-cost computation of Chao and Miller (Algorithmica, 1995) with the overlap-cost computation of Baker and Giancarlo (ESA, 1998). By developing llchain, a simpler $O(n \log n)$-time implementation of our method, we show how chaining algorithms that might have been recently overlooked by the bioinformatics community scale competitively to millions of fragments and large genomes. On average, llchain is $10\times$ faster than other methods on instances with $3\,000\,000$ anchors, and over $3\times$ faster on MEMs between HiFi reads and a reference human genome.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript claims an O(n log log n)-time, O(n)-space algorithm for anchored edit distance (colinear chaining under L_infty gap cost and Delta_diag overlap cost) obtained by combining the gap-cost data structure of Chao and Miller (Algorithmica 1995) with the overlap-cost data structure of Baker and Giancarlo (ESA 1998). It also supplies an explicit simpler O(n log n) implementation called llchain, together with an analysis of a merged query procedure running in O(log log n) per fragment and empirical timings showing 3-10x speedups over prior methods on instances with up to 3 million anchors.

Significance. If the stated combination and merged-query analysis hold, the result revives overlooked classical O(n log log n) chaining techniques for a cost model directly relevant to modern seed-chain-extend pipelines in long-read mapping. The manuscript supplies an explicit algorithm description, direct complexity analysis of the simultaneous data-structure maintenance, and reproducible performance data on large genomes; these elements strengthen the contribution beyond the abstract claim.

minor comments (2)
  1. [Abstract] Abstract: the O(n log n) llchain variant is introduced only in the final sentence; a parenthetical note on its complexity would improve readability.
  2. Section 4 (or wherever the merged query is defined): the pseudocode or high-level description of how the two structures are maintained simultaneously could include an explicit statement of the per-fragment query time to make the O(log log n) bound immediately visible.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript, the recognition of its significance for modern seed-chain-extend pipelines, and the recommendation to accept. We are pleased that the explicit algorithm description, complexity analysis, and empirical results were viewed as strengthening the contribution.

Circularity Check

0 steps flagged

No significant circularity; derivation assembles independent external algorithms

full rationale

The paper's central claim is an algorithmic combination of the Chao-Miller (1995) gap-cost data structure and Baker-Giancarlo (1998) overlap-cost data structure, both from unrelated prior authors. The manuscript provides an explicit merged query procedure and direct O(log log n) per-fragment complexity analysis. No equations reduce to inputs by construction, no self-citations are load-bearing, and no ansatz or uniqueness result is smuggled from the authors' own prior work. The derivation is self-contained against the cited external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the correctness of two cited prior algorithms and standard assumptions about range-query data structures; no new free parameters or invented entities are introduced.

axioms (1)
  • standard math Correctness of the gap-cost computation in Chao and Miller (Algorithmica, 1995) and the overlap-cost computation in Baker and Giancarlo (ESA, 1998).
    The new algorithm is defined as their direct combination.

pith-pipeline@v0.9.1-grok · 5886 in / 1292 out tokens · 23531 ms · 2026-06-28T07:59:50.092712+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

  1. [1]

    URL: https://web.archive.org/web/20250708030638/https://www.iscb.org/cms_ addon/conferences/ismbeccb2004/short%20papers/19.pdf

    Short paper. URL: https://web.archive.org/web/20250708030638/https://www.iscb.org/cms_ addon/conferences/ismbeccb2004/short%20papers/19.pdf. 2 Mohamed Ibrahim Abouelhoda and Enno Ohlebusch. Chaining algorithms for multiple genome comparison.J. Discrete Algorithms, 3(2-4):321–341, 2005.doi:10.1016/J.JDA.2004.08.011. 3 Brenda S. Baker and Raffaele Giancarlo...

  2. [2]

    McCreight

    4 Rudolf Bayer and Edward M. McCreight. Organization and maintenance of large ordered indexes.Acta Informatica, 1(3):173–189, 1972.doi:10.1007/bf00288683. 5 Gary Benson, Avivit Levy, S. Maimoni, D. Noifeld, and B. Riva Shalom. LCSk: A refined similarity measure.Theor. Comput. Sci., 638:11–26, 2016.doi:10.1016/J.TCS.2015.11.026. 6 Gary Benson, Avivit Levy,...

  3. [3]

    12 Arthur L

    doi: 10.1007/978-3-540-77974-2. 12 Arthur L. Delcher, Simon Kasif, Robert D. Fleischmann, Jeremy Peterson, Owen White, and Steven L. Salzberg. Alignment of whole genomes.Nucleic Acids Research, 27(11):2369–2376, January 1999.doi:10.1093/nar/27.11.2369. 13 David Eppstein, Zvi Galil, Raffaele Giancarlo, and Giuseppe F. Italiano. Sparse dynamic programming I...

  4. [4]

    Guibas and Robert Sedgewick

    doi:10.1109/SFCS.1978.3. 17 Dan Gusfield.Algorithms on Strings, Trees, and Sequences - Computer Science and Computa- tional Biology. Cambridge University Press, 1997.doi:10.1017/CBO9780511574931. 18 Yijie Han. Deterministic sorting inO(nlog logn )time and linear space. In John H. Reif, editor,Proceedings on 34th Annual ACM Symposium on Theory of Computing...

  5. [5]

    20 Chirag Jain, Daniel Gibney, and Sharma V

    URL:https://www.bx.psu.edu/~rsharris/rsharris_phd_ thesis_2007.pdf. 20 Chirag Jain, Daniel Gibney, and Sharma V. Thankachan. Algorithms for colinear chaining with overlaps and gap costs.J. Comput. Biol., 29(11):1237–1251, 2022.doi:10.1089/CMB. 2022.0266. 21 Donald B. Johnson. A priority queue in which initialization and queue operations take O(log logD)ti...

  6. [6]

    22 Stefan Kurtz, Adam Phillippy, Arthur L

    doi: 10.1007/bf01786986. 22 Stefan Kurtz, Adam Phillippy, Arthur L. Delcher, Michael Smoot, Martin Shumway, Corina Antonescu, and Steven L. Salzberg. Versatile and open software for comparing large genomes. Genome biology, 5(2):R12, 2004.doi:10.1186/gb-2004-5-2-r12. 23 Heng Li. Minimap2: pairwise alignment for nucleotide sequences.Bioinform., 34(18):3094–...

  7. [7]

    and Mikheenko, Alla and Vollger, Mitchell R

    URL:http://dl.acm.org/citation.cfm?id=313651.313661. 29 Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V. Bzikadze, Alla Mikheenko, Mitchell R. Vollger, Nicolas Altemose, Lev Uralsky, Ariel Gershman, Sergey Aganezov, Savannah J. Hoyt, Mark Diekhans, Glennis A. Logsdon, Michael Alonge, Stylianos E. Antonarakis, Matthew Borchers, Gerard G. ...

  8. [8]

    33 Filip Pavetić, Ivan Katanić, Gustav Matula, Goran Žužić, and Mile Šikić

    doi: 10.1186/1748-7188-6-4. 33 Filip Pavetić, Ivan Katanić, Gustav Matula, Goran Žužić, and Mile Šikić. Fast and simple algorithms for computing bothLCS k and LCS k+.arXiv,

  9. [9]

    doi:10.48550/ARXIV.1705. 07279. 34 Filip Pavetić, Goran Žužić, and Mile Šikić.LCSk++: Practical similarity metric for long strings.arXiv, 2014.doi:10.48550/ARXIV.1407.2407. 35 Jingwen Ren and Mark J. P. Chaisson. lra: A long read aligner for sequences and contigs. PLoS Comput. Biol., 17(6), 2021.doi:10.1371/JOURNAL.PCBI.1009078. 36 Arang Rhie, Sergey Nurk...

  10. [10]

    39 Kristoffer Sahlin, Thomas Baudeau, Bastien Cazaux, and Camille Marchet

    arXiv:2506.11750, doi:10.48550/ARXIV.2506.11750. 39 Kristoffer Sahlin, Thomas Baudeau, Bastien Cazaux, and Camille Marchet. A survey of mapping algorithms in the long-reads era.Genome Biology, 24(1):133,

  11. [11]

    Accurate spliced alignment of long RNA sequencing reads

    40 Kristoffer Sahlin and Veli Mäkinen. Accurate spliced alignment of long RNA sequencing reads. Bioinform., 37(24):4643–4651, 2021.doi:10.1093/BIOINFORMATICS/BTAB540. 41 P. van Emde Boas. Preserving order in a forest in less than logarithmic time and linear space. Information Processing Letters, 6(3):80–82, June 1977.doi:10.1016/0020-0190(77)90031-x. 42 W...

  12. [12]

    44 Dan E. Willard. Log-logarithmic worst-case range queries are possible in spaceθ(n).Informa- tion Processing Letters, 17(2):81–84, August 1983.doi:10.1016/0020-0190(83)90075-3. 20 RevisitingO(nlog logn)Chaining for Anchored Edit Distance 45 Zheng Zhang, Balaji Raghavachari, Ross C. Hardison, and Webb Miller. Chaining multiple- alignment blocks.J. Comput...