Recognition: no theorem link
Compressed Index with Construction in Compressed Space
Pith reviewed 2026-05-15 22:23 UTC · model grok-4.3
The pith
A string index of size O(δ log n/δ) supports fast searches and builds itself in the same space via one streaming pass.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The index on s with O(δ log n/δ) space can search in s any string of length m in O(m + (occ + 1) log^ε n) time. The index can be built in O(n log n) expected time by one left-to-right pass on the string s in a streaming fashion with O(δ log n/δ) construction space. This is the first index that can be constructed within such space and with such time guarantees.
What carries the argument
Streaming left-to-right construction of the compressed index that keeps total space at O(δ log n/δ) throughout the single pass, using hash tables or deterministic dictionaries and avoiding Karp-Rabin fingerprints.
If this is right
- Search times match the currently best known bounds for compressed indexes.
- Space usage is nearly optimal and equals the known lower bound O(δ log n/(δ α)) whenever δ = O(n / α²).
- Construction processes the entire string in one left-to-right pass with expected O(n log n) time.
- The construction can be made fully deterministic by replacing hash tables with deterministic dictionaries, at the cost of a slowdown.
Where Pith is reading between the lines
- The same streaming technique could reduce peak memory for other string data structures whose construction currently requires uncompressed working space.
- Indexing very long strings on memory-constrained hardware becomes feasible when δ remains small relative to n.
- The method suggests possible extensions to dynamic or updateable indexes that also stay within compressed space bounds.
Load-bearing premise
The string complexity δ is known in advance and satisfies δ at least on the order of log log n.
What would settle it
An input string s with known δ where the construction space exceeds O(δ log n/δ) at any moment during the single pass or where a search for some pattern exceeds O(m + (occ + 1) log^ε n) time.
read the original abstract
Suppose that we are given a string $s$ of length $n$ over an alphabet $\{0,1,\ldots,n^{O(1)}\}$ and $\delta$ is the string complexity of $s$, a known compression measure. We describe an index on $s$ with $O(\delta\log\frac{n}{\delta})$ space, measured in $O(\log n)$-bit machine words, which can search in $s$ any string of length $m$ in $O(m + (\mathrm{occ} + 1)\log^\epsilon n)$ time, where $\mathrm{occ}$ is the number of occurrences and $\epsilon > 0$ is any fixed constant (the big-O in the space bound hides factor $\frac{1}{\epsilon}$). Crucially, the index can be built in $O(n\log n)$ expected time by one left-to-right pass on the string $s$ in a streaming fashion with $O(\delta\log\frac{n}{\delta})$ construction space. The index does not use the Karp--Rabin fingerprints, and the randomization in the construction time can be eliminated by using deterministic dictionaries instead of hash tables (with a slowdown). The search time matches currently best results and the space is almost optimal (the known optimum is $O(\delta\log\frac{n}{\delta\alpha})$, where $\alpha = \log_\sigma n$ and $\sigma$ is the alphabet size, and it coincides with $O(\delta\log\frac{n}{\delta})$ when $\delta = O(n / \alpha^2)$). This is the first index that can be constructed within such space and with such time guarantees. To avoid uninteresting marginal cases, all above bounds are stated for $\delta \ge \Omega(\log\log n)$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a compressed index for a string s of length n over a polynomial-sized alphabet, where δ denotes the string complexity of s (assumed known and at least Ω(log log n)). The index uses O(δ log(n/δ)) words of space, supports pattern matching queries for a string of length m in O(m + (occ + 1) log^ε n) time for any fixed ε > 0, and is constructed via a single left-to-right streaming pass in O(n log n) expected time using only O(δ log(n/δ)) space. The construction avoids Karp-Rabin fingerprints and can be made deterministic at the cost of a slowdown. The space is claimed to be nearly optimal, matching the known lower bound when δ = O(n / α²) for α = log_σ n.
Significance. If the stated bounds and streaming construction hold, the result is significant: it is the first index achieving simultaneous near-optimal compressed space, optimal query time, and true one-pass streaming construction in compressed space. The explicit asymptotic guarantees without additional free parameters (beyond the supplied δ and ε) and the avoidance of fingerprints strengthen the contribution. The work provides a concrete algorithmic advance in the area of compressed string indexes.
major comments (1)
- [Abstract] Abstract and construction description: the requirement that δ is known in advance is load-bearing for the one-pass streaming claim. All sampling rates, block sizes, and space allocations are functions of the supplied δ; an incorrect value would immediately violate the O(δ log(n/δ)) space bound or invalidate the query guarantees. The manuscript should explicitly state whether δ can be computed or approximated within the same streaming pass and space bound, or acknowledge this as a prerequisite limitation.
minor comments (2)
- [Abstract] The space bound hides a 1/ε factor; the construction section should clarify how the hidden constant depends on ε and whether it affects the streaming space allocation.
- [Abstract] Notation: define occ and ε at first use and ensure consistent use of big-O notation across the query and construction bounds.
Simulated Author's Rebuttal
We thank the referee for their careful reading and positive recommendation of our manuscript. We address the major comment point by point below.
read point-by-point responses
-
Referee: [Abstract] Abstract and construction description: the requirement that δ is known in advance is load-bearing for the one-pass streaming claim. All sampling rates, block sizes, and space allocations are functions of the supplied δ; an incorrect value would immediately violate the O(δ log(n/δ)) space bound or invalidate the query guarantees. The manuscript should explicitly state whether δ can be computed or approximated within the same streaming pass and space bound, or acknowledge this as a prerequisite limitation.
Authors: We agree with the referee that the construction requires δ to be known in advance, as the algorithm's parameters are set based on this value. The paper presents δ as a known input measure, and our claims are conditional on this. Determining δ (or a suitable approximation) in a single streaming pass within the same space bound is not addressed in this work and remains an open problem. In the revised version, we will update the abstract and the construction description to explicitly acknowledge that δ is a prerequisite input to the algorithm, and that the streaming construction assumes it is supplied. revision: yes
Circularity Check
No circularity: explicit algorithmic construction with stated assumptions
full rationale
The paper states an explicit one-pass streaming construction whose space O(δ log n/δ) and query time O(m + (occ+1) log^ε n) are expressed directly in terms of the externally supplied parameters n and δ (with δ ≥ Ω(log log n) given as a precondition). No equation or claim reduces a derived quantity to a fitted input, no self-citation is invoked as a uniqueness theorem, and no ansatz is smuggled via prior work. The requirement that δ be known in advance is an explicit modeling assumption that enables parameter setting; it does not create a definitional loop or force the claimed bounds by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Alphabet size is n^{O(1)} and machine words are O(log n) bits.
- domain assumption Deterministic dictionaries can replace hash tables to remove randomization.
Forward citations
Cited by 1 Pith paper
-
Dynamic Grammar-Compressed Self-Index in $\delta$-Optimal Space
The dynamic RR-index is the first dynamic self-index to attain δ-optimal space, with locate in expected O(m + log m log² n + occ (log n / log log n)) time and updates in O(m' log² n + log³ n) time.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.