arxiv: 2605.04258 · v1 · submitted 2026-05-05 · 💻 cs.DS

Recognition: unknown

Constructing Suffixient Arrays Revisited

Paola Bonizzoni , Younan Gao , Brian Riccardi

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:23 UTC · model grok-4.3

classification 💻 cs.DS

keywords suffixient arraysuffix arrayLCP arrayBurrows-Wheeler transformlinear time constructionone-pass algorithmtext indexingmaximal exact matches

0 comments

The pith

A new one-pass algorithm builds the suffixient array in linear time under the RAM model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an algorithm that constructs the suffixient array by scanning the suffix array, LCP array, and Burrows-Wheeler transform of the reversed text exactly once. This index is a compact subset of the full suffix array that still lets a user locate one occurrence of a pattern or report its maximal exact matches when random access to the original text is allowed. Earlier methods either needed several passes over the same arrays to reach linear time or paid an extra logarithmic factor when restricted to a single pass. The new procedure removes that overhead while keeping the total construction cost strictly linear in the text length.

Core claim

Given the suffix array, LCP array, and BWT of the reversed text over alphabet size sigma, the algorithm produces a correct suffixient array after a single left-to-right traversal of these structures, using only constant-time operations per position and therefore O(n) time in the standard RAM model.

What carries the argument

A single-pass decision procedure that marks positions for inclusion in the suffixient array by comparing LCP values against run boundaries in the reverse BWT and maintaining a small amount of local state.

If this is right

The suffixient array becomes available after one sequential read of the auxiliary arrays instead of repeated traversals.
Construction time remains linear even when the algorithm is forced to process its input in a streaming fashion.
The same index still supports single-occurrence pattern location and maximal exact match enumeration with random access to the text.
No extra logarithmic dependence on alphabet size appears in the one-pass setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique may reduce cache traffic when building indexes for large repetitive collections such as genomes.
Similar one-pass logic could be applied to other subset indices that currently require multiple passes.
It indicates that the extra log factor in the earlier single-pass method was an artifact of the algorithm rather than a lower bound.

Load-bearing premise

The suffix array, LCP array, and BWT of the reversed text are already computed and can be read sequentially once.

What would settle it

An input text on which the output set of suffix ranks either misses a required position for correct MEM reporting or includes an unnecessary position, or on which the measured running time exceeds O(n) in the worst case.

read the original abstract

Recently, Cenzato et al.\ proposed a new text index, called the \emph{suffixient array}, which is a subset of the suffix array and supports locating a single pattern occurrence or finding its maximal exact matches (MEMs), assuming random access to the input text $T[1..n]$ is available. They show that, given the suffix array, the longest common prefix array, and the Burrows--Wheeler transform (BWT) of the reverse of $T[1..n]$ over an alphabet $\{1,\ldots,\sigma\}$, a suffixient array can be constructed in linear time. However, their construction algorithms require multiple scans of these arrays. When restricted to a single pass over the arrays, they present an alternative construction algorithm running in $O(n + \overline{r} \log \sigma)$ time, where $\overline{r}$ is the number of runs in the BWT of the reversed text. In this paper, we present a new one-pass algorithm that constructs a suffixient array in linear time under the standard RAM model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean one-pass linear-time construction for suffixient arrays, improving the prior O(n + r log sigma) bound.

read the letter

The main thing to know is that they replace the earlier one-pass construction with a true O(n) algorithm under the standard RAM model. The new method does a single scan over the SA, LCP, and BWT of the reversed text and uses LCP values to pick positions without the log sigma cost from handling BWT runs. That is the actual advance over Cenzato et al. and it is a real algorithmic step rather than a minor tweak. The construction is presented as a direct procedure with no hidden multi-pass behavior or parameter fitting, which keeps the claim straightforward. The paper does well at making the one-pass version practical while preserving linearity, and the RAM-model assumption is standard for this area. A soft spot is that the whole thing still requires the SA, LCP, and BWT of the reversed text to be available first; those structures are not free to build, though the paper limits its claim to the construction step itself. Another minor point is that full verification of constant-time per element operations would need the detailed pseudocode or proof, but nothing in the argument suggests extra logarithmic or quadratic costs. This is for people working on string indexes, suffix arrays, and text compression who care about construction time for compact structures like suffixient arrays. A reader already familiar with BWT and LCP arrays will see the value in the improved bound. It deserves a serious referee because the improvement is concrete, the claim is clear, and there is no load-bearing circularity or unstated assumption that breaks the linear-time result.

Referee Report

0 major / 2 minor

Summary. The paper presents a new one-pass algorithm for constructing the suffixient array in linear time under the standard RAM model. Given the suffix array, LCP array, and BWT of the reversed text, the algorithm performs a single scan that uses LCP values to select the required suffix positions, avoiding the O(n + r-bar log sigma) overhead of the prior one-pass method by Cenzato et al.

Significance. If the algorithm is correct, the result delivers an optimal linear-time, single-pass construction for the suffixient array, a compact subset of the suffix array that supports single-occurrence pattern location and MEM finding with random access to the text. This removes the logarithmic dependence on the number of BWT runs and is a clear practical and theoretical advance for large-scale text indexing. The paper's direct algorithmic procedure under the word-RAM model, with no free parameters or hidden costs, is a strength.

minor comments (2)

The manuscript would benefit from explicit pseudocode for the one-pass scan (currently described in prose) to make the constant-time per-element claim immediately verifiable.
A short complexity analysis paragraph immediately after the algorithm description would help readers confirm that all operations (LCP lookups, position selections) are O(1) under the standard RAM model with random access.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the paper and for recommending minor revision. The referee correctly identifies that the new algorithm achieves true linear time in a single pass over the input arrays, removing the dependence on the number of BWT runs that appeared in the prior one-pass method.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes a direct one-pass algorithmic procedure for constructing the suffixient array given the SA, LCP, and BWT of the reversed text under the standard RAM model. The claimed linear time bound follows from a single scan that uses LCP values for position selection, without any equations, parameters, or claims that reduce the result to a fitted input, self-definition, or load-bearing self-citation chain. The construction is presented as an independent procedural contribution that does not rely on renaming known results or smuggling ansatzes via prior work by the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard string-algorithm assumptions rather than introducing new fitted parameters or invented entities.

axioms (2)

domain assumption Standard RAM model with constant-time random access to arrays of size n
Invoked when claiming linear time for the one-pass scan.
domain assumption Suffix array, LCP array, and BWT of reversed text are given as input
Stated in the abstract as the starting point for construction.

pith-pipeline@v0.9.0 · 5480 in / 1216 out tokens · 34978 ms · 2026-05-08T17:23:58.334612+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 1 canonical work pages

[1]

15 Esko Ukkonen

doi:10.1089/CMB.2021.0290. 15 Esko Ukkonen. On-Line Construction of Suffix Trees.Algorithmica, 14(3):249–260, 1995. doi:10.1007/BF01206331. 16 Peter Weiner. Linear Pattern Matching Algorithms. In14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages1–11.IEEEComputer Society, 1973. doi:10.1109/SWAT.1973.13....

work page doi:10.1089/cmb.2021.0290 2021
[2]

is_start ←(pre_bwt̸=−1) and (curr_bwt̸= pre_bwt)
[3]

is_end ←(next_bwt̸=−1) and (curr_bwt̸= next_bwt)
[4]

return max(wstart,wend); B The Pseudocode for Computing b(i) Algorithm 2 Initial-Stack()
[5]

push (1,−1,−1) as (index,val,b _val) ontoS
[6]

return S Algorithm 3 Compute_b(S, LCP[i],i)
[7]

if tuple.index=i then
[8]

while tuple.val≥LCP[i] do
[9]

push (i, LCP[i],tuple.index) ontoS
[10]

17 Algorithm 4 Compute Suffixient Arrays (BWT[1..n], SA[1..n], LCP[1..n])

return tuple.index C The Pseudocode and an Example for Computing Suffixient Arrays Bonizzoni et al. 17 Algorithm 4 Compute Suffixient Arrays (BWT[1..n], SA[1..n], LCP[1..n])
[11]

BWT[n + 1]←LCP[n + 1]←−1
[12]

rowList←doubly-linked list
[13]

resulti←an empty list
[14]

if BWT[2]̸= BWT[1] then
[15]

append (n−SA[1] + 1, BWT[1], LCP[2]) as (ptext,char,weight ) to the front ofrowList
[16]

MAP[c]←head of rowList
[17]

ℓ←LCP[i]; ▷Step 1 starts
[19]

while ℓ<headTriple.weightdo
[22]

MAP[headTriple.char]←null
[23]

currWeight ←Compute_w(BWT[i−1], BWT[i], BWT[i + 1], LCP[i], LCP[i + 1])
[24]

curr_b←Compute_b(S, LCP[i],i ); ▷Step 1 ends
[25]

if currWeight> −1 then ▷Step 2 starts
[26]

if currWeight ̸= LCP[i] then
[27]

curr_b←Compute_b(S, LCP[i + 1],i + 1)
[28]

if i = Candtc(curr_b,i + 1) then
[29]

if MAP[c]̸=null then
[30]

remove triple from rowList
[31]

append (n−SA[i] + 1, BWT[i],currWeight ) to the front ofrowList
[32]

MAP[c]←head of rowList; ▷Step 2 ends
[33]

while rowList̸= null do ▷Final step starts
[34]

headTriple←rowList.head
[35]

append headTriple.ptext to resultheadTriple.char
[36]

remove the head entry from rowList
[37]

The table providesT rev [1..n], the LCP, BWT, and SA arrays, the variablecurr_b, the contents ofrowList, and the character-specific result listsResultc

return [result1, result2,..., resultσ]; ▷Final step ends 18 Constructing Suffixient Arrays Revisited Table 1 Execution trace of the algorithm computing a suffixient array for the inputT = AGCACAGCA$. The table providesT rev [1..n], the LCP, BWT, and SA arrays, the variablecurr_b, the contents ofrowList, and the character-specific result listsResultc. The ...