pith. machine review for the scientific record. sign in

arxiv: 2605.04244 · v1 · submitted 2026-05-05 · 💻 cs.DS

Recognition: unknown

Faster Iterative φ Queries on the Positional BWT

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:31 UTC · model grok-4.3

classification 💻 cs.DS
keywords positional Burrows-Wheeler transformPBWTrefined segmentsphi querieshaplotype panelsspace-time tradeoffsrun-length compressioniterative predecessor queries
0
0 comments X

The pith

A decomposition of haplotypes into refined segments yields two improved space-time tradeoffs for k iterative predecessor queries on the positional Burrows-Wheeler transform.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a decomposition that splits each haplotype row into refined segments inside which the co-lexicographic predecessor haplotype stays the same. These segments are constructed so that any segment overlaps only a constant number of segments belonging to its predecessor haplotype and the total number of segments across all haplotypes stays linear in the number of runs plus the number of haplotypes. From this partition the authors derive two index structures that answer successive predecessor queries either faster at the same space cost or in less space at a comparable cost. The improvement is most noticeable when the number of haplotypes is much smaller than the number of sites, which matches current large-scale genomic panels. Such repeated predecessor steps are used to locate matching haplotype stretches and to detect variation.

Core claim

We introduce a decomposition scheme that divides each haplotype row of the PBWT into refined segments within which a haplotype's co-lexicographic predecessor remains unchanged at the end of the segment. These refined segments satisfy a constant-overlap property with the segments of their predecessor and have a total count bounded by O(r̃ + h). This allows two space-time tradeoffs: one using O((r̃ + h) log n) bits that answers k iterative φ queries in O(log log_w min(m,h) + k) time, and a more compact one using O(r̃ log h + h log n) bits that answers them in O(k log log_w h) time, improving on the prior O(k log log_w m) query time.

What carries the argument

The refined segments decomposition, which partitions each haplotype row into intervals of constant co-lexicographic predecessor and enforces constant overlap with the predecessor haplotype's segments.

If this is right

  • The first structure answers k iterative φ queries in O(log log_w min(m,h) + k) time using O((r̃ + h) log n) bits.
  • The second structure answers the same queries in O(k log log_w h) time using O(r̃ log h + h log n) bits.
  • Both structures improve on the earlier bound of O((r̃ + h) log n) bits and O(k log log_w m) time.
  • The second tradeoff becomes attractive for panels where the number of haplotypes h is much smaller than the number of sites m.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same constant-overlap decomposition may reduce the cost of other traversal queries that follow lexicographic order through the PBWT.
  • If the refined segments themselves can be computed in linear time, the new indexes could be built for panels that are currently too large for repeated query workloads.
  • The approach could be combined with existing run-length compression to lower memory further for population-scale genomic datasets.
  • Similar segment-based partitions might apply to predecessor queries on other compressed string structures that store multiple aligned sequences.

Load-bearing premise

Refined segments can be defined so that each overlaps only a constant number of segments from its predecessor haplotype while the total number of segments remains O(r̃ + h).

What would settle it

A haplotype panel in which every possible partition into constant-overlap predecessor-invariant intervals requires more than O(r̃ + h) segments in total.

read the original abstract

The Positional Burrows-Wheeler Transform (PBWT) is a fundamental data structure for the efficient representation and analysis of large-scale haplotype panels. For a panel of $h$ sequences $\{S_1, \dots, S_h\}$ over $m$ sites, a key operation is the $\phi_j(i)$ query, which returns the haplotype index immediately preceding $S_i$ in co-lexicographic order at site $j$. Efficient support for $k$ iterative queries $\phi^1, \dots, \phi^k$ is essential for haplotype matching and variation analysis. In this work, we introduce a simple and novel decomposition scheme that decomposes each haplotype row into sub-intervals, called refined segments, within which a haplotype's co-lexicographic predecessor for the sites remains unchanged. We show that refined segments satisfy two key properties: (i) each segment $[b,e]$ associated with $S_i$ overlaps with at most a constant number of segments of $S_{\phi_e(i)}$, and (ii) the total number of segments is bounded by $O(\tilde{r} + h)$, where $\tilde{r}$ denotes the number of runs in the PBWT. Building on this decomposition, we present two space-time tradeoffs for supporting $k$ iterative $\phi$ queries: (i) a structure using $O((\tilde{r} + h)\log n)$ bits of space that answers $k$ iterative queries in $O(\log \log_w \min(m,h) + k)$ time, where $n = m \cdot h$, and (ii) a more compact structure using $O(\tilde{r} \log h + h \log n)$ bits of space that supports queries in $O(k \log \log_w h)$ time. Prior to our work, supporting these queries required $O((\tilde{r} + h)\log n)$ bits of space and $O(k \cdot \log \log_w m)$ time. Our second tradeoff is expected to be effective in practice for modern genomic datasets, where the number $h$ of haplotypes is typically much smaller than the number $m$ of sites.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces a decomposition of each haplotype row in the Positional BWT into refined segments [b,e] such that (i) each overlaps at most a constant number of segments from the predecessor's row S_φ_e(i) and (ii) the global number of segments is O(r̃ + h). Building on these properties, it derives two space-time tradeoffs for k iterative φ queries: O((r̃ + h) log n) bits with O(log log_w min(m,h) + k) time, and O(r̃ log h + h log n) bits with O(k log log_w h) time, improving the prior O(k log log_w m) query time.

Significance. If the two properties of the refined-segment decomposition can be established, the work would yield practically relevant improvements for iterative φ queries, which support core haplotype-matching and variation-analysis tasks on large panels. The second, more compact tradeoff is noted as likely effective when h ≪ m, a common regime in modern genomic data. The approach uses standard stringology tools on run-length-encoded PBWTs and supplies explicit space-time expressions.

major comments (1)
  1. [decomposition into refined segments] The construction and proof that refined segments exist satisfying both the constant-overlap property and the O(r̃ + h) total-size bound simultaneously (abstract and the section defining the decomposition). Both claimed tradeoffs are derived directly from these two properties; if no such decomposition exists for arbitrary PBWT encodings, or if satisfying one forces the other to exceed its bound, the space-time results collapse. An explicit inductive construction or charging argument is required.
minor comments (2)
  1. [query-time analysis] Clarify the precise constant in the overlap property and how it propagates into the log-log factors in the query-time analysis.
  2. [preliminaries] Define r̃ on first use and state the relation n = m · h explicitly in the preliminaries.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for recognizing the potential impact of the refined-segment approach on practical haplotype-matching tasks. We address the single major comment below.

read point-by-point responses
  1. Referee: [decomposition into refined segments] The construction and proof that refined segments exist satisfying both the constant-overlap property and the O(r̃ + h) total-size bound simultaneously (abstract and the section defining the decomposition). Both claimed tradeoffs are derived directly from these two properties; if no such decomposition exists for arbitrary PBWT encodings, or if satisfying one forces the other to exceed its bound, the space-time results collapse. An explicit inductive construction or charging argument is required.

    Authors: The manuscript already contains an explicit construction of the refined segments (Section 3) obtained by partitioning each haplotype row at every position where its co-lexicographic predecessor changes, using only the run-length encoding of the PBWT. The constant-overlap property (each segment overlaps at most two segments of its predecessor row) follows directly from the fact that the co-lex order is monotonic inside each run; crossings can occur only at run boundaries. The O(r̃ + h) cardinality bound is proved by a charging argument that assigns every refined segment either to a distinct run or to one of the h haplotype boundaries. We agree, however, that presenting the argument in fully inductive form (on the number of sites processed) would make the simultaneous satisfaction of both properties clearer. We will therefore expand Section 3 with an inductive construction and a detailed charging proof in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

Decomposition properties established independently; derivation self-contained

full rationale

The paper defines refined segments via an explicit construction on the PBWT run-length encoding and states that it proves the two load-bearing properties (constant overlap per segment and O(r̃ + h) global size) directly from the PBWT structure. These properties then yield the two space bounds and query times without any reduction to fitted parameters, self-referential definitions, or load-bearing self-citations. No step equates a claimed prediction or uniqueness result to its own inputs by construction; the central existence claim is supported by a separate proof rather than being presupposed by the final data-structure bounds.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the existence of a decomposition into refined segments that obeys the constant-overlap and O(r̃ + h) size properties; these are domain-specific assumptions introduced and asserted in the work.

axioms (2)
  • domain assumption Refined segments exist such that each overlaps at most a constant number of segments belonging to its co-lexicographic predecessor at the right endpoint.
    This overlap property is invoked to control the space and query time of the supporting structures.
  • domain assumption The total number of refined segments across all haplotypes is O(r̃ + h).
    This size bound is used to derive the space complexity of both presented structures.
invented entities (1)
  • refined segments no independent evidence
    purpose: Sub-intervals of each haplotype row inside which the co-lexicographic predecessor remains constant.
    Newly defined objects that enable the overlap and size bounds used for the query data structures.

pith-pipeline@v0.9.0 · 5712 in / 1477 out tokens · 54187 ms · 2026-05-08T17:31:52.522941+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 18 canonical work pages · 1 internal anchor

  1. [1]

    Nature , volume=

    A global reference for human genetic variation , author=. Nature , volume=

  2. [2]

    Fredman and Dan E

    Michael L. Fredman and Dan E. Willard , title =. J. Comput. Syst. Sci. , volume =. 1993 , doi =

  3. [3]

    Move-r: Optimizing the r-index , booktitle =

    Nico Bertram and Johannes Fischer and Lukas Nalbach , editor =. Move-r: Optimizing the r-index , booktitle =. 2024 , doi =

  4. [4]

    Brown and Ahsan Sanaullah and Shaojie Zhang and Ben Langmead , title =

    Nathaniel K. Brown and Ahsan Sanaullah and Shaojie Zhang and Ben Langmead , title =. CoRR , volume =. 2026 , doi =

  5. [5]

    An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed

    Ahsan Sanaullah and Degui Zhi and Shaojie Zhang , editor =. An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed. 25th International Conference on Algorithms for Bioinformatics,. 2025 , doi =

  6. [6]

    Brown and Travis Gagie and Massimiliano Rossi , editor =

    Nathaniel K. Brown and Travis Gagie and Massimiliano Rossi , editor =. 20th International Symposium on Experimental Algorithms,. 2022 , doi =

  7. [7]

    Proceedings of the Nine Workshop on Algorithm Engineering and Experiments,

    Daisuke Okanohara and Kunihiko Sadakane , title =. Proceedings of the Nine Workshop on Algorithm Engineering and Experiments,. 2007 , doi =

  8. [8]

    37th Annual Symposium on Combinatorial Pattern Matching,

    Paola Bonizzoni and Davide Cozzi and Younan Gao , title =. 37th Annual Symposium on Combinatorial Pattern Matching,

  9. [9]

    Guibas , title =

    Bernard Chazelle and Leonidas J. Guibas , title =. Algorithmica , volume =. 1986 , doi =

  10. [10]

    , title =

    Burrows, Michael and Wheeler, David J. , title =. 1994 , number =

  11. [11]

    Myers , title =

    Udi Manber and Eugene W. Myers , title =. 1993 , doi =

  12. [12]

    Permuted Longest-Common-Prefix Array , booktitle =

    Juha K. Permuted Longest-Common-Prefix Array , booktitle =. 2009 , doi =

  13. [13]

    41st Annual Symposium on Foundations of Computer Science,

    Paolo Ferragina and Giovanni Manzini , title =. 41st Annual Symposium on Foundations of Computer Science,. 2000 , doi =

  14. [14]

    2019 , doi =

    Ardalan Naseri and Degui Zhi and Shaojie Zhang , title =. 2019 , doi =

  15. [16]

    48th International Colloquium on Automata, Languages, and Programming,

    Takaaki Nishimoto and Yasuo Tabei , editor =. 48th International Colloquium on Automata, Languages, and Programming,. 2021 , doi =

  16. [17]

    Journal of the

    Travis Gagie and Gonzalo Navarro and Nicola Prezza , title =. Journal of the. 2020 , doi =

  17. [18]

    Bioinform

    Richard Durbin , title =. Bioinform. , volume =. 2014 , doi =

  18. [19]

    Bioinform

    Davide Cozzi and Massimiliano Rossi and Simone Rubinacci and Travis Gagie and Dominik K. Bioinform. , volume =. 2023 , doi =

  19. [21]

    Morrison , title =

    Donald R. Morrison , title =. Journal of the. 1968 , url =

  20. [22]

    Myers , title =

    Udi Manber and Eugene W. Myers , title =. 1993 , url =

  21. [23]

    Smallest suffixient sets as a repetitiveness measure

    Navarro, Gonzalo and Romana, Giuseppe and Urbina, Cristian. Smallest suffixient sets as a repetitiveness measure. Lecture Notes in Computer Science

  22. [24]

    On compressing and indexing repetitive sequences

    Kreft, Sebastian and Navarro, Gonzalo. On compressing and indexing repetitive sequences. Theor. Comput. Sci

  23. [25]

    An upper bound and linear-space queries on the LZ-end parsing

    Kempa, Dominik and Saha, Barna. An upper bound and linear-space queries on the LZ-end parsing. Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms ( SODA )

  24. [26]

    Indexing highly repetitive string collections

    Navarro, Gonzalo. Indexing highly repetitive string collections. 2004.02781

  25. [27]

    Grammar-based codes: a new class of universal lossless source codes

    Kieffer, J C and Yang, En-Hui. Grammar-based codes: a new class of universal lossless source codes. IEEE Trans. Inf. Theory

  26. [28]

    Self-indexed text compression using straight-line programs

    Claude, Francisco and Navarro, Gonzalo. Self-indexed text compression using straight-line programs. Mathematical Foundations of Computer Science 2009

  27. [29]

    Random access to grammar-compressed strings and trees

    Bille, Philip and Landau, Gad M and Raman, Rajeev and Sadakane, Kunihiko and Satti, Srinivasa Rao and Weimann, Oren. Random access to grammar-compressed strings and trees. SIAM J. Comput

  28. [30]

    Fully dynamic data structure for LCE queries in compressed space

    Nishimoto, Takaaki and I, Tomohiro and Inenaga, Shunsuke and Bannai, Hideo and Takeda, Masayuki. Fully dynamic data structure for LCE queries in compressed space

  29. [31]

    Optimal-time dictionary-compressed indexes

    Christiansen, Anders Roy and Ettienne, Mikko Berggren and Kociumaka, Tomasz and Navarro, Gonzalo and Prezza, Nicola. Optimal-time dictionary-compressed indexes. ACM T ransactions on A lgorithms

  30. [32]

    2022 , publisher=

    Halldorsson, Bjarni V and Eggertsson, Hannes P and Moore, Kristjan HS and Hauswedell, Hannes and Eiriksson, Ogmundur and Ulfarsson, Magnus O and Palsson, Gunnar and Hardarson, Marteinn T and Oddsson, Asmundur and Jensson, Brynjar O and others , journal=. 2022 , publisher=

  31. [33]

    CoRR , volume =

    Travis Gagie and Giovanni Manzini and Marinella Sciortino , title =. CoRR , volume =. 2022 , doi =

  32. [34]

    String Processing and Information Retrieval - 30th International Symposium,

    Paola Bonizzoni and Christina Boucher and Davide Cozzi and Travis Gagie and Dominik K. String Processing and Information Retrieval - 30th International Symposium,. 2023 , url =. doi:10.1007/978-3-031-43980-3\_8 , timestamp =

  33. [35]

    22nd International Workshop on Algorithms in Bioinformatics,

    Ahsan Sanaullah and Degui Zhi and Shaoije Zhang , editor =. 22nd International Workshop on Algorithms in Bioinformatics,. 2022 , doi =

  34. [36]

    Optimal Lower and Upper Bounds for Representing Sequences

    Djamal Belazzougui and Gonzalo Navarro. Optimal Lower and Upper Bounds for Representing Sequences . ACM Transactions on Algorithms , 11(4), April 2015. https://doi.org/10.1145/2629339 doi:10.1145/2629339

  35. [37]

    Move-r: Optimizing the r-index

    Nico Bertram, Johannes Fischer, and Lukas Nalbach. Move-r: Optimizing the r-index. In Leo Liberti, editor, 22nd International Symposium on Experimental Algorithms, SEA 2024, Vienna, Austria, July 23-26, 2024 , LIPIcs, pages 1:1--1:19. Schloss Dagstuhl - Leibniz-Zentrum f \" u r Informatik, 2024. https://doi.org/10.4230/LIPICS.SEA.2024.1 doi:10.4230/LIPICS...

  36. [38]

    Solving the Minimal Positional Substring Cover Problem in Sublinear Space

    Paola Bonizzoni, Christina Boucher, Davide Cozzi, Travis Gagie, and Yuri Pirola. Solving the Minimal Positional Substring Cover Problem in Sublinear Space . In Shunsuke Inenaga and Simon J. Puglisi, editors, 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024) , volume 296 of LIPIcs , pages 12:1--12:16, Dagstuhl, Germany, 2024. Schloss Dagst...

  37. [39]

    Optimal-Time Mapping in Run-Length Compressed PBWT

    Paola Bonizzoni, Davide Cozzi, and Younan Gao. Optimal-Time Mapping in Run-Length Compressed PBWT . In Philip Bille and Nicola Prezza, editors, 37th Annual Symposium on Combinatorial Pattern Matching, CPM 2026 (to appear) , LIPIcs. Schloss Dagstuhl - Leibniz-Zentrum f \" u r Informatik, 2026

  38. [40]

    Brown, Travis Gagie, and Massimiliano Rossi

    Nathaniel K. Brown, Travis Gagie, and Massimiliano Rossi. RLBWT tricks. In Christian Schulz and Bora U c ar, editors, 20th International Symposium on Experimental Algorithms, SEA 2022, Heidelberg, Germany, July 25-27, 2022 , LIPIcs, pages 16:1--16:16. Schloss Dagstuhl - Leibniz-Zentrum f \" u r Informatik, 2022. https://doi.org/10.4230/LIPICS.SEA.2022.16 ...

  39. [41]

    Optimal-Time Move Structure Construction

    Nathaniel K. Brown, Ahsan Sanaullah, Shaojie Zhang, and Ben Langmead. Optimal-time move structure balancing and LCP array computation from the RLBWT . CoRR , abs/2603.22147, 2026. https://doi.org/10.48550/ARXIV.2603.22147 doi:10.48550/ARXIV.2603.22147

  40. [42]

    Michael Burrows and David J. Wheeler. A Block-sorting Lossless Data Compression Algorithm . Technical Report SRC-TR-124, Digital Equipment Corporation, Palo Alto, CA, USA, May 1994

  41. [43]

    Bernard Chazelle and Leonidas J. Guibas. Fractional cascading: I. A data structuring technique. Algorithmica , 1(2):133--162, 1986. https://doi.org/10.1007/BF01840440 doi:10.1007/BF01840440

  42. [44]

    A global reference for human genetic variation

    1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature , 526(7571):68, 2015

  43. [45]

    \( \) - PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data

    Davide Cozzi, Massimiliano Rossi, Simone Rubinacci, Travis Gagie, Dominik K \" o ppl, Christina Boucher, and Paola Bonizzoni. \( \) - PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data . Bioinform. , 39(9), 2023. https://doi.org/10.1093/BIOINFORMATICS/BTAD552 doi:10.1093/BIOINFORMATICS/BTAD552

  44. [46]

    Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT)

    Richard Durbin. Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT) . Bioinform. , 30(9):1266--1272, 2014. https://doi.org/10.1093/BIOINFORMATICS/BTU014 doi:10.1093/BIOINFORMATICS/BTU014

  45. [47]

    Opportunistic data structures with applications

    Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, Redondo Beach, California, USA, November 12-14, 2000 , pages 390--398. IEEE Computer Society, 2000. https://doi.org/10.1109/SFCS.2000.892127 doi:10.1109/SFCS.2000.892127

  46. [48]

    Fredman and Dan E

    Michael L. Fredman and Dan E. Willard. Surpassing the information theoretic bound with fusion trees. J. Comput. Syst. Sci. , 47(3):424--436, 1993. https://doi.org/10.1016/0022-0000(93)90040-4 doi:10.1016/0022-0000(93)90040-4

  47. [49]

    15 Travis Gagie, Gonzalo Navarro, and Simon J

    Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space . Journal of the ACM , 67(1):2:1--2:54, 2020. https://doi.org/10.1145/3375890 doi:10.1145/3375890

  48. [50]

    The sequences of 150,119 genomes in the UK Biobank

    Bjarni V Halldorsson, Hannes P Eggertsson, Kristjan HS Moore, Hannes Hauswedell, Ogmundur Eiriksson, Magnus O Ulfarsson, Gunnar Palsson, Marteinn T Hardarson, Asmundur Oddsson, Brynjar O Jensson, et al. The sequences of 150,119 genomes in the UK Biobank . Nature , pages 1--9, 2022. https://doi.org/10.1038/s41586-022-04965-x doi:10.1038/s41586-022-04965-x

  49. [51]

    a rkk \

    Juha K \" a rkk \" a inen, Giovanni Manzini, and Simon J. Puglisi. Permuted longest-common-prefix array. In Gregory Kucherov and Esko Ukkonen, editors, Combinatorial Pattern Matching, 20th Annual Symposium, CPM 2009, Lille, France, June 22-24, 2009, Proceedings , Lecture Notes in Computer Science, pages 181--192. Springer, 2009. https://doi.org/10.1007/97...

  50. [52]

    Udi Manber and Eugene W. Myers. Suffix Arrays: A New Method for On-Line String Searches . SIAM J. Comput. , 22(5):935--948, 1993. https://doi.org/10.1137/0222058 doi:10.1137/0222058

  51. [53]

    Optimal-Time Queries on BWT-Runs Compressed Indexes

    Takaaki Nishimoto and Yasuo Tabei. Optimal-Time Queries on BWT-Runs Compressed Indexes . In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, Glasgow, Scotland (Virtual Conference), July 12-16, 2021 , LIPIcs, pages 101:1--101:15. Schloss Dagstuhl - Leibniz-Zentru...

  52. [54]

    Practical Entropy-Compressed Rank/Select Dictionary

    Daisuke Okanohara and Kunihiko Sadakane. Practical Entropy-Compressed Rank/Select Dictionary . In Proceedings of the Nine Workshop on Algorithm Engineering and Experiments, ALENEX 2007, New Orleans, Louisiana, USA, January 6, 2007 . SIAM , 2007. https://doi.org/10.1137/1.9781611972870.6 doi:10.1137/1.9781611972870.6