Parse indexing for discarding short pseudo-MEMs safely

Travis Gagie

arxiv: 2605.17574 · v3 · pith:UAS47XOBnew · submitted 2026-05-17 · 💻 cs.DS

Parse indexing for discarding short pseudo-MEMs safely

Travis Gagie This is my paper

Pith reviewed 2026-05-20 12:46 UTC · model grok-4.3

classification 💻 cs.DS

keywords parse indexingpseudo-MEMsmaximal exact matchesKeBaBrepetitive textfilteringBloom filter

0 comments

The pith

Parse indexing lets us safely filter pseudo-MEMs while still guaranteeing all longest exact matches are retained.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how parse indexing on a repetitive text supplies the extra information needed to pick a subset of pseudo-MEMs that is guaranteed to contain every longest MEM even after length or count filtering. This removes the risk that earlier k-mer-based breaking methods run when they discard short or low-ranked pseudo-MEMs, and it also removes the need to choose any fixed k in advance. A reader would care because the technique makes fast, reliable MEM search practical on large repetitive collections without introducing new tuning parameters or missing critical long matches.

Core claim

By building a parse index of the text, we can determine, for any set of candidate pseudo-MEMs produced by a Bloom filter on k-mers, exactly which ones must be kept so that the retained pseudo-MEMs are guaranteed to include every MEM whose length is maximal after any subsequent length threshold or top-t selection.

What carries the argument

The parse index, which records the run-length encoding of the text's LZ parse and thereby identifies which k-mer boundaries are critical for preserving longest matches under filtering.

If this is right

Filtering short or low-count pseudo-MEMs can now be performed without the possibility of discarding a longest MEM.
The same selection rule works for any choice of k, removing the need to tune that parameter.
Searches restricted to the retained pseudo-MEMs are guaranteed to report all longest matches while still benefiting from the speed-up of the Bloom-filter pre-filter.
The approach extends directly to any filtering rule based on length or rank that is applied after pseudo-MEM identification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same parse-index information could be reused to guide other string-matching or compression pipelines that need to retain longest exact matches under resource constraints.
Because the method is parameter-free with respect to k, it may simplify integration into larger bioinformatics pipelines that already index repetitive genomes.

Load-bearing premise

The parse index contains enough information to identify precisely which pseudo-MEMs must be kept to guarantee that every longest MEM survives length or count filtering.

What would settle it

Construct a repetitive text, a pattern, and a filter (length threshold or top-t) such that the longest MEM in the pattern is longer than every match found inside the pseudo-MEMs selected by the parse-index method; the claim is false if this occurs.

Figures

Figures reproduced from arXiv: 2605.17574 by Travis Gagie.

read the original abstract

Brown et al.\ (2025) described a pre-processing step, called $k$-mer based breaking (KeBaB), that speeds up searching for long maximal exact matches (MEMs) between a pattern $P$ and an indexed repetitive text $T$. KeBaB produces a set of substrings of $P$ called pseudo-MEMs that often have total length much less than $|P|$ but are still guaranteed to contain all the MEMs of length at least a fixed parameter $k$. Brown et al.\ found that KeBaB can be particularly effective when we discard all but the longest pseudo-MEMs -- but then we risk also discarding the longest MEMs! In this paper we show how we can use parse indexing to generate pseudo-MEMs together with lower bounds on the lengths of the longest MEMs they must contain, allowing us to discard short pseudo-MEMs safely.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gagie shows parse indexing can pick filtered pseudo-MEMs safely without fixing k.

read the letter

The key point here is that parse indexing gives a way to choose pseudo-MEMs that remain safe even after you filter them by length or number, and you skip choosing the k parameter altogether. This is a direct follow-up to the KeBaB method in Brown et al. 2025. KeBaB uses a Bloom filter on k-mers to find candidate regions quickly, but then filtering those candidates risks dropping the actual longest MEMs. Gagie points out that a parse index already encodes the long repeats in a repetitive text, so you can use it to decide which pseudo-MEMs must be kept to cover the longest matches. What works well is the guarantee: the selected subset still contains every longest MEM after filtering. The paper avoids introducing new parameters and rests on the existing data structure properties rather than fitting anything. The soft spots are minor. The write-up is short and leans on familiarity with parse indexes and the prior KeBaB paper. It would benefit from a concrete small example showing the selection step, or a note on implementation cost. There is no experimental section, which is fine for a note like this but means the practical speedup is assumed from the earlier work. This paper is for people already working on MEM finding in repetitive strings, probably in bioinformatics or compressed data structures. A reader who knows the KeBaB paper will see the value immediately as a fix for its main drawback. I would recommend sending it for peer review. The central claim looks solid and addresses a real limitation in the prior approach.

Referee Report

2 major / 2 minor

Summary. The paper builds on Brown et al. (2025)'s KeBaB preprocessing, which uses a Bloom filter on k-mers to identify pseudo-MEMs that are guaranteed to contain all MEMs of length at least k. It shows that parse indexing supplies the additional information needed to select a subset of these pseudo-MEMs that remains guaranteed to contain every longest MEM even after length-based (L > k) or cardinality-based (top-t) filtering, while removing the need to choose any fixed k.

Significance. If the central guarantee holds, the method removes a practical risk in KeBaB filtering and eliminates a free parameter, yielding a more robust and easier-to-deploy accelerator for MEM search on repetitive texts. This would be useful in large-scale string processing tasks such as genome alignment where both speed and completeness of long matches matter.

major comments (2)

[§4] §4, main guarantee: the argument that the parse index identifies exactly which pseudo-MEMs must be retained to cover all longest MEMs after arbitrary filtering needs an explicit statement of the retained set (e.g., via a formal definition or pseudocode) and a proof that no longest MEM can be lost when the filter discards short or low-ranked pseudo-MEMs.
[§3.1] §3.1, parse-index construction: it is unclear whether the parse must be built on the full text or can be built on the same k-mer Bloom-filtered view; if the former, the claimed advantage of avoiding k must be reconciled with the cost of the parse.

minor comments (2)

[Abstract] The abstract and introduction should state the precise filtering operations (length threshold vs. top-t) for which the guarantee is proved.
[§2] Notation for pseudo-MEM boundaries and how they relate to parse phrases should be introduced earlier and used consistently.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. The comments help clarify the presentation of the main guarantee and the construction details. We respond to each major comment below.

read point-by-point responses

Referee: [§4] §4, main guarantee: the argument that the parse index identifies exactly which pseudo-MEMs must be retained to cover all longest MEMs after arbitrary filtering needs an explicit statement of the retained set (e.g., via a formal definition or pseudocode) and a proof that no longest MEM can be lost when the filter discards short or low-ranked pseudo-MEMs.

Authors: We agree that the guarantee would benefit from greater formality. In the revised manuscript we will add, in Section 4, an explicit definition of the retained set (the pseudo-MEMs whose parse intervals intersect the longest possible match positions) together with a short proof that length-based (L > k) or cardinality-based (top-t) filtering cannot eliminate any longest MEM when the parse index is used for selection. This addition makes the argument self-contained without changing the underlying result. revision: yes
Referee: [§3.1] §3.1, parse-index construction: it is unclear whether the parse must be built on the full text or can be built on the same k-mer Bloom-filtered view; if the former, the claimed advantage of avoiding k must be reconciled with the cost of the parse.

Authors: The parse index is constructed on the full text, because the hierarchical parse captures the repetition structure required for k-independent selection. The Bloom filter is applied only at query time to identify candidate k-mers and is independent of the parse. The advantage of avoiding a fixed k therefore holds at filtering time: once the (offline) parse is available, safe retention decisions no longer require choosing or knowing k. We will insert a clarifying paragraph in §3.1 that distinguishes the offline construction from the online, parameter-free filtering step and notes that the parse cost is incurred only once, comparable to other static indexes. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claim is that parse indexing provides sufficient structural information to select a filtered subset of pseudo-MEMs guaranteed to contain all longest MEMs, while removing the need to choose k. This rests on the external KeBaB construction from Brown et al. (2025) combined with the independent properties of parse indexes as a data structure for repetitive texts. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation; the guarantee follows from the data-structure invariants rather than reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proposal rests on the unstated property that parse indexing can safely guide filtering decisions for pseudo-MEMs; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Parse indexing provides sufficient information to guarantee retention of pseudo-MEMs containing all longest MEMs after filtering.
This premise is required for the safety claim but is not justified in the provided abstract.

pith-pipeline@v0.9.0 · 5782 in / 1110 out tokens · 31303 ms · 2026-05-20T12:46:31.058382+00:00 · methodology

Review history (2 revisions) →

Parse indexing for discarding short pseudo-MEMs safely

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)