Recognition: unknown
Faster Iterative φ Queries on the Positional BWT
Pith reviewed 2026-05-08 17:31 UTC · model grok-4.3
The pith
A decomposition of haplotypes into refined segments yields two improved space-time tradeoffs for k iterative predecessor queries on the positional Burrows-Wheeler transform.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a decomposition scheme that divides each haplotype row of the PBWT into refined segments within which a haplotype's co-lexicographic predecessor remains unchanged at the end of the segment. These refined segments satisfy a constant-overlap property with the segments of their predecessor and have a total count bounded by O(r̃ + h). This allows two space-time tradeoffs: one using O((r̃ + h) log n) bits that answers k iterative φ queries in O(log log_w min(m,h) + k) time, and a more compact one using O(r̃ log h + h log n) bits that answers them in O(k log log_w h) time, improving on the prior O(k log log_w m) query time.
What carries the argument
The refined segments decomposition, which partitions each haplotype row into intervals of constant co-lexicographic predecessor and enforces constant overlap with the predecessor haplotype's segments.
If this is right
- The first structure answers k iterative φ queries in O(log log_w min(m,h) + k) time using O((r̃ + h) log n) bits.
- The second structure answers the same queries in O(k log log_w h) time using O(r̃ log h + h log n) bits.
- Both structures improve on the earlier bound of O((r̃ + h) log n) bits and O(k log log_w m) time.
- The second tradeoff becomes attractive for panels where the number of haplotypes h is much smaller than the number of sites m.
Where Pith is reading between the lines
- The same constant-overlap decomposition may reduce the cost of other traversal queries that follow lexicographic order through the PBWT.
- If the refined segments themselves can be computed in linear time, the new indexes could be built for panels that are currently too large for repeated query workloads.
- The approach could be combined with existing run-length compression to lower memory further for population-scale genomic datasets.
- Similar segment-based partitions might apply to predecessor queries on other compressed string structures that store multiple aligned sequences.
Load-bearing premise
Refined segments can be defined so that each overlaps only a constant number of segments from its predecessor haplotype while the total number of segments remains O(r̃ + h).
What would settle it
A haplotype panel in which every possible partition into constant-overlap predecessor-invariant intervals requires more than O(r̃ + h) segments in total.
read the original abstract
The Positional Burrows-Wheeler Transform (PBWT) is a fundamental data structure for the efficient representation and analysis of large-scale haplotype panels. For a panel of $h$ sequences $\{S_1, \dots, S_h\}$ over $m$ sites, a key operation is the $\phi_j(i)$ query, which returns the haplotype index immediately preceding $S_i$ in co-lexicographic order at site $j$. Efficient support for $k$ iterative queries $\phi^1, \dots, \phi^k$ is essential for haplotype matching and variation analysis. In this work, we introduce a simple and novel decomposition scheme that decomposes each haplotype row into sub-intervals, called refined segments, within which a haplotype's co-lexicographic predecessor for the sites remains unchanged. We show that refined segments satisfy two key properties: (i) each segment $[b,e]$ associated with $S_i$ overlaps with at most a constant number of segments of $S_{\phi_e(i)}$, and (ii) the total number of segments is bounded by $O(\tilde{r} + h)$, where $\tilde{r}$ denotes the number of runs in the PBWT. Building on this decomposition, we present two space-time tradeoffs for supporting $k$ iterative $\phi$ queries: (i) a structure using $O((\tilde{r} + h)\log n)$ bits of space that answers $k$ iterative queries in $O(\log \log_w \min(m,h) + k)$ time, where $n = m \cdot h$, and (ii) a more compact structure using $O(\tilde{r} \log h + h \log n)$ bits of space that supports queries in $O(k \log \log_w h)$ time. Prior to our work, supporting these queries required $O((\tilde{r} + h)\log n)$ bits of space and $O(k \cdot \log \log_w m)$ time. Our second tradeoff is expected to be effective in practice for modern genomic datasets, where the number $h$ of haplotypes is typically much smaller than the number $m$ of sites.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a decomposition of each haplotype row in the Positional BWT into refined segments [b,e] such that (i) each overlaps at most a constant number of segments from the predecessor's row S_φ_e(i) and (ii) the global number of segments is O(r̃ + h). Building on these properties, it derives two space-time tradeoffs for k iterative φ queries: O((r̃ + h) log n) bits with O(log log_w min(m,h) + k) time, and O(r̃ log h + h log n) bits with O(k log log_w h) time, improving the prior O(k log log_w m) query time.
Significance. If the two properties of the refined-segment decomposition can be established, the work would yield practically relevant improvements for iterative φ queries, which support core haplotype-matching and variation-analysis tasks on large panels. The second, more compact tradeoff is noted as likely effective when h ≪ m, a common regime in modern genomic data. The approach uses standard stringology tools on run-length-encoded PBWTs and supplies explicit space-time expressions.
major comments (1)
- [decomposition into refined segments] The construction and proof that refined segments exist satisfying both the constant-overlap property and the O(r̃ + h) total-size bound simultaneously (abstract and the section defining the decomposition). Both claimed tradeoffs are derived directly from these two properties; if no such decomposition exists for arbitrary PBWT encodings, or if satisfying one forces the other to exceed its bound, the space-time results collapse. An explicit inductive construction or charging argument is required.
minor comments (2)
- [query-time analysis] Clarify the precise constant in the overlap property and how it propagates into the log-log factors in the query-time analysis.
- [preliminaries] Define r̃ on first use and state the relation n = m · h explicitly in the preliminaries.
Simulated Author's Rebuttal
We thank the referee for the careful review and for recognizing the potential impact of the refined-segment approach on practical haplotype-matching tasks. We address the single major comment below.
read point-by-point responses
-
Referee: [decomposition into refined segments] The construction and proof that refined segments exist satisfying both the constant-overlap property and the O(r̃ + h) total-size bound simultaneously (abstract and the section defining the decomposition). Both claimed tradeoffs are derived directly from these two properties; if no such decomposition exists for arbitrary PBWT encodings, or if satisfying one forces the other to exceed its bound, the space-time results collapse. An explicit inductive construction or charging argument is required.
Authors: The manuscript already contains an explicit construction of the refined segments (Section 3) obtained by partitioning each haplotype row at every position where its co-lexicographic predecessor changes, using only the run-length encoding of the PBWT. The constant-overlap property (each segment overlaps at most two segments of its predecessor row) follows directly from the fact that the co-lex order is monotonic inside each run; crossings can occur only at run boundaries. The O(r̃ + h) cardinality bound is proved by a charging argument that assigns every refined segment either to a distinct run or to one of the h haplotype boundaries. We agree, however, that presenting the argument in fully inductive form (on the number of sites processed) would make the simultaneous satisfaction of both properties clearer. We will therefore expand Section 3 with an inductive construction and a detailed charging proof in the revised manuscript. revision: yes
Circularity Check
Decomposition properties established independently; derivation self-contained
full rationale
The paper defines refined segments via an explicit construction on the PBWT run-length encoding and states that it proves the two load-bearing properties (constant overlap per segment and O(r̃ + h) global size) directly from the PBWT structure. These properties then yield the two space bounds and query times without any reduction to fitted parameters, self-referential definitions, or load-bearing self-citations. No step equates a claimed prediction or uniqueness result to its own inputs by construction; the central existence claim is supported by a separate proof rather than being presupposed by the final data-structure bounds.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Refined segments exist such that each overlaps at most a constant number of segments belonging to its co-lexicographic predecessor at the right endpoint.
- domain assumption The total number of refined segments across all haplotypes is O(r̃ + h).
invented entities (1)
-
refined segments
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Nature , volume=
A global reference for human genetic variation , author=. Nature , volume=
-
[2]
Fredman and Dan E
Michael L. Fredman and Dan E. Willard , title =. J. Comput. Syst. Sci. , volume =. 1993 , doi =
1993
-
[3]
Move-r: Optimizing the r-index , booktitle =
Nico Bertram and Johannes Fischer and Lukas Nalbach , editor =. Move-r: Optimizing the r-index , booktitle =. 2024 , doi =
2024
-
[4]
Brown and Ahsan Sanaullah and Shaojie Zhang and Ben Langmead , title =
Nathaniel K. Brown and Ahsan Sanaullah and Shaojie Zhang and Ben Langmead , title =. CoRR , volume =. 2026 , doi =
2026
-
[5]
An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed
Ahsan Sanaullah and Degui Zhi and Shaojie Zhang , editor =. An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed. 25th International Conference on Algorithms for Bioinformatics,. 2025 , doi =
2025
-
[6]
Brown and Travis Gagie and Massimiliano Rossi , editor =
Nathaniel K. Brown and Travis Gagie and Massimiliano Rossi , editor =. 20th International Symposium on Experimental Algorithms,. 2022 , doi =
2022
-
[7]
Proceedings of the Nine Workshop on Algorithm Engineering and Experiments,
Daisuke Okanohara and Kunihiko Sadakane , title =. Proceedings of the Nine Workshop on Algorithm Engineering and Experiments,. 2007 , doi =
2007
-
[8]
37th Annual Symposium on Combinatorial Pattern Matching,
Paola Bonizzoni and Davide Cozzi and Younan Gao , title =. 37th Annual Symposium on Combinatorial Pattern Matching,
-
[9]
Guibas , title =
Bernard Chazelle and Leonidas J. Guibas , title =. Algorithmica , volume =. 1986 , doi =
1986
-
[10]
, title =
Burrows, Michael and Wheeler, David J. , title =. 1994 , number =
1994
-
[11]
Myers , title =
Udi Manber and Eugene W. Myers , title =. 1993 , doi =
1993
-
[12]
Permuted Longest-Common-Prefix Array , booktitle =
Juha K. Permuted Longest-Common-Prefix Array , booktitle =. 2009 , doi =
2009
-
[13]
41st Annual Symposium on Foundations of Computer Science,
Paolo Ferragina and Giovanni Manzini , title =. 41st Annual Symposium on Foundations of Computer Science,. 2000 , doi =
2000
-
[14]
2019 , doi =
Ardalan Naseri and Degui Zhi and Shaojie Zhang , title =. 2019 , doi =
2019
-
[16]
48th International Colloquium on Automata, Languages, and Programming,
Takaaki Nishimoto and Yasuo Tabei , editor =. 48th International Colloquium on Automata, Languages, and Programming,. 2021 , doi =
2021
-
[17]
Journal of the
Travis Gagie and Gonzalo Navarro and Nicola Prezza , title =. Journal of the. 2020 , doi =
2020
-
[18]
Bioinform
Richard Durbin , title =. Bioinform. , volume =. 2014 , doi =
2014
-
[19]
Bioinform
Davide Cozzi and Massimiliano Rossi and Simone Rubinacci and Travis Gagie and Dominik K. Bioinform. , volume =. 2023 , doi =
2023
-
[21]
Morrison , title =
Donald R. Morrison , title =. Journal of the. 1968 , url =
1968
-
[22]
Myers , title =
Udi Manber and Eugene W. Myers , title =. 1993 , url =
1993
-
[23]
Smallest suffixient sets as a repetitiveness measure
Navarro, Gonzalo and Romana, Giuseppe and Urbina, Cristian. Smallest suffixient sets as a repetitiveness measure. Lecture Notes in Computer Science
-
[24]
On compressing and indexing repetitive sequences
Kreft, Sebastian and Navarro, Gonzalo. On compressing and indexing repetitive sequences. Theor. Comput. Sci
-
[25]
An upper bound and linear-space queries on the LZ-end parsing
Kempa, Dominik and Saha, Barna. An upper bound and linear-space queries on the LZ-end parsing. Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms ( SODA )
2022
-
[26]
Indexing highly repetitive string collections
Navarro, Gonzalo. Indexing highly repetitive string collections. 2004.02781
-
[27]
Grammar-based codes: a new class of universal lossless source codes
Kieffer, J C and Yang, En-Hui. Grammar-based codes: a new class of universal lossless source codes. IEEE Trans. Inf. Theory
-
[28]
Self-indexed text compression using straight-line programs
Claude, Francisco and Navarro, Gonzalo. Self-indexed text compression using straight-line programs. Mathematical Foundations of Computer Science 2009
2009
-
[29]
Random access to grammar-compressed strings and trees
Bille, Philip and Landau, Gad M and Raman, Rajeev and Sadakane, Kunihiko and Satti, Srinivasa Rao and Weimann, Oren. Random access to grammar-compressed strings and trees. SIAM J. Comput
-
[30]
Fully dynamic data structure for LCE queries in compressed space
Nishimoto, Takaaki and I, Tomohiro and Inenaga, Shunsuke and Bannai, Hideo and Takeda, Masayuki. Fully dynamic data structure for LCE queries in compressed space
-
[31]
Optimal-time dictionary-compressed indexes
Christiansen, Anders Roy and Ettienne, Mikko Berggren and Kociumaka, Tomasz and Navarro, Gonzalo and Prezza, Nicola. Optimal-time dictionary-compressed indexes. ACM T ransactions on A lgorithms
-
[32]
2022 , publisher=
Halldorsson, Bjarni V and Eggertsson, Hannes P and Moore, Kristjan HS and Hauswedell, Hannes and Eiriksson, Ogmundur and Ulfarsson, Magnus O and Palsson, Gunnar and Hardarson, Marteinn T and Oddsson, Asmundur and Jensson, Brynjar O and others , journal=. 2022 , publisher=
2022
-
[33]
CoRR , volume =
Travis Gagie and Giovanni Manzini and Marinella Sciortino , title =. CoRR , volume =. 2022 , doi =
2022
-
[34]
String Processing and Information Retrieval - 30th International Symposium,
Paola Bonizzoni and Christina Boucher and Davide Cozzi and Travis Gagie and Dominik K. String Processing and Information Retrieval - 30th International Symposium,. 2023 , url =. doi:10.1007/978-3-031-43980-3\_8 , timestamp =
-
[35]
22nd International Workshop on Algorithms in Bioinformatics,
Ahsan Sanaullah and Degui Zhi and Shaoije Zhang , editor =. 22nd International Workshop on Algorithms in Bioinformatics,. 2022 , doi =
2022
-
[36]
Optimal Lower and Upper Bounds for Representing Sequences
Djamal Belazzougui and Gonzalo Navarro. Optimal Lower and Upper Bounds for Representing Sequences . ACM Transactions on Algorithms , 11(4), April 2015. https://doi.org/10.1145/2629339 doi:10.1145/2629339
-
[37]
Move-r: Optimizing the r-index
Nico Bertram, Johannes Fischer, and Lukas Nalbach. Move-r: Optimizing the r-index. In Leo Liberti, editor, 22nd International Symposium on Experimental Algorithms, SEA 2024, Vienna, Austria, July 23-26, 2024 , LIPIcs, pages 1:1--1:19. Schloss Dagstuhl - Leibniz-Zentrum f \" u r Informatik, 2024. https://doi.org/10.4230/LIPICS.SEA.2024.1 doi:10.4230/LIPICS...
-
[38]
Solving the Minimal Positional Substring Cover Problem in Sublinear Space
Paola Bonizzoni, Christina Boucher, Davide Cozzi, Travis Gagie, and Yuri Pirola. Solving the Minimal Positional Substring Cover Problem in Sublinear Space . In Shunsuke Inenaga and Simon J. Puglisi, editors, 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024) , volume 296 of LIPIcs , pages 12:1--12:16, Dagstuhl, Germany, 2024. Schloss Dagst...
-
[39]
Optimal-Time Mapping in Run-Length Compressed PBWT
Paola Bonizzoni, Davide Cozzi, and Younan Gao. Optimal-Time Mapping in Run-Length Compressed PBWT . In Philip Bille and Nicola Prezza, editors, 37th Annual Symposium on Combinatorial Pattern Matching, CPM 2026 (to appear) , LIPIcs. Schloss Dagstuhl - Leibniz-Zentrum f \" u r Informatik, 2026
2026
-
[40]
Brown, Travis Gagie, and Massimiliano Rossi
Nathaniel K. Brown, Travis Gagie, and Massimiliano Rossi. RLBWT tricks. In Christian Schulz and Bora U c ar, editors, 20th International Symposium on Experimental Algorithms, SEA 2022, Heidelberg, Germany, July 25-27, 2022 , LIPIcs, pages 16:1--16:16. Schloss Dagstuhl - Leibniz-Zentrum f \" u r Informatik, 2022. https://doi.org/10.4230/LIPICS.SEA.2022.16 ...
-
[41]
Optimal-Time Move Structure Construction
Nathaniel K. Brown, Ahsan Sanaullah, Shaojie Zhang, and Ben Langmead. Optimal-time move structure balancing and LCP array computation from the RLBWT . CoRR , abs/2603.22147, 2026. https://doi.org/10.48550/ARXIV.2603.22147 doi:10.48550/ARXIV.2603.22147
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.22147 2026
-
[42]
Michael Burrows and David J. Wheeler. A Block-sorting Lossless Data Compression Algorithm . Technical Report SRC-TR-124, Digital Equipment Corporation, Palo Alto, CA, USA, May 1994
1994
-
[43]
Bernard Chazelle and Leonidas J. Guibas. Fractional cascading: I. A data structuring technique. Algorithmica , 1(2):133--162, 1986. https://doi.org/10.1007/BF01840440 doi:10.1007/BF01840440
-
[44]
A global reference for human genetic variation
1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature , 526(7571):68, 2015
2015
-
[45]
\( \) - PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data
Davide Cozzi, Massimiliano Rossi, Simone Rubinacci, Travis Gagie, Dominik K \" o ppl, Christina Boucher, and Paola Bonizzoni. \( \) - PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data . Bioinform. , 39(9), 2023. https://doi.org/10.1093/BIOINFORMATICS/BTAD552 doi:10.1093/BIOINFORMATICS/BTAD552
-
[46]
Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT)
Richard Durbin. Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT) . Bioinform. , 30(9):1266--1272, 2014. https://doi.org/10.1093/BIOINFORMATICS/BTU014 doi:10.1093/BIOINFORMATICS/BTU014
-
[47]
Opportunistic data structures with applications
Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, Redondo Beach, California, USA, November 12-14, 2000 , pages 390--398. IEEE Computer Society, 2000. https://doi.org/10.1109/SFCS.2000.892127 doi:10.1109/SFCS.2000.892127
-
[48]
Michael L. Fredman and Dan E. Willard. Surpassing the information theoretic bound with fusion trees. J. Comput. Syst. Sci. , 47(3):424--436, 1993. https://doi.org/10.1016/0022-0000(93)90040-4 doi:10.1016/0022-0000(93)90040-4
-
[49]
15 Travis Gagie, Gonzalo Navarro, and Simon J
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space . Journal of the ACM , 67(1):2:1--2:54, 2020. https://doi.org/10.1145/3375890 doi:10.1145/3375890
-
[50]
The sequences of 150,119 genomes in the UK Biobank
Bjarni V Halldorsson, Hannes P Eggertsson, Kristjan HS Moore, Hannes Hauswedell, Ogmundur Eiriksson, Magnus O Ulfarsson, Gunnar Palsson, Marteinn T Hardarson, Asmundur Oddsson, Brynjar O Jensson, et al. The sequences of 150,119 genomes in the UK Biobank . Nature , pages 1--9, 2022. https://doi.org/10.1038/s41586-022-04965-x doi:10.1038/s41586-022-04965-x
-
[51]
Juha K \" a rkk \" a inen, Giovanni Manzini, and Simon J. Puglisi. Permuted longest-common-prefix array. In Gregory Kucherov and Esko Ukkonen, editors, Combinatorial Pattern Matching, 20th Annual Symposium, CPM 2009, Lille, France, June 22-24, 2009, Proceedings , Lecture Notes in Computer Science, pages 181--192. Springer, 2009. https://doi.org/10.1007/97...
-
[52]
Udi Manber and Eugene W. Myers. Suffix Arrays: A New Method for On-Line String Searches . SIAM J. Comput. , 22(5):935--948, 1993. https://doi.org/10.1137/0222058 doi:10.1137/0222058
-
[53]
Optimal-Time Queries on BWT-Runs Compressed Indexes
Takaaki Nishimoto and Yasuo Tabei. Optimal-Time Queries on BWT-Runs Compressed Indexes . In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, Glasgow, Scotland (Virtual Conference), July 12-16, 2021 , LIPIcs, pages 101:1--101:15. Schloss Dagstuhl - Leibniz-Zentru...
-
[54]
Practical Entropy-Compressed Rank/Select Dictionary
Daisuke Okanohara and Kunihiko Sadakane. Practical Entropy-Compressed Rank/Select Dictionary . In Proceedings of the Nine Workshop on Algorithm Engineering and Experiments, ALENEX 2007, New Orleans, Louisiana, USA, January 6, 2007 . SIAM , 2007. https://doi.org/10.1137/1.9781611972870.6 doi:10.1137/1.9781611972870.6
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.