Compressed-Resident Genomics: Full-Pipeline Device-Resident GPU LZ77 Decode with Position-Invariant Random Access

Yakiv Shavidze

arxiv: 2606.18900 · v1 · pith:I6QQIS3Qnew · submitted 2026-06-17 · 💻 cs.DC

Compressed-Resident Genomics: Full-Pipeline Device-Resident GPU LZ77 Decode with Position-Invariant Random Access

Yakiv Shavidze This is my paper

Pith reviewed 2026-06-26 19:20 UTC · model grok-4.3

classification 💻 cs.DC

keywords GPU decompressionLZ77genomicsrandom accessFASTQdevice-resident pipelineACEAPEX

0 comments

The pith

A full device-resident GPU LZ77 pipeline decodes genomic data at 260 GB/s while supporting random access to individual reads in 0.362 ms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends the ACEAPEX absolute-offset parallel LZ77 codec to run its complete decompression pipeline entirely on the GPU, performing both entropy decoding and match resolution without host intervention. This produces bit-perfect output at up to 260 GB/s on FASTQ files. A compact coordinate index enables position-invariant random access, decoding any arbitrary read in 0.362 ms. For genomes too large to fit in VRAM, a range-decode strategy sustains 165.7 GB/s on 50 GB data. The approach also notes that an open entropy codec can reach even higher rates.

Core claim

By extending ACEAPEX into a full device-resident GPU decode pipeline, entropy decoding and match resolution both stay on the device to reach 260 GB/s on FASTQ, a compact coordinate index supports position-invariant random access that decodes an arbitrary read in 0.362 ms, and a range-decode strategy decouples output size from VRAM to sustain 165.7 GB/s on a 50 GB genome, all while remaining bit-perfect.

What carries the argument

The device-resident GPU decode pipeline that performs entropy decoding and match resolution without host intervention, built on the absolute-offset parallel LZ77 codec ACEAPEX.

If this is right

Full on-device processing removes host-device transfer overhead during genomic decompression.
Position-invariant random access allows direct extraction of individual reads without decompressing preceding data.
Range decoding enables processing of genomes larger than available VRAM at 165.7 GB/s.
The smaller read-to-block index reduces storage overhead compared with standard .fai files.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same on-device pipeline structure could apply to other LZ77-based formats that currently force full sequential decompression.
Combining the pipeline with the faster open DietGPU entropy stage would create an entirely open high-throughput stack for compressed genomics.
Random-access performance at sub-millisecond latency per read could support interactive queries over petabyte-scale archives without first materializing decompressed copies.

Load-bearing premise

The ACEAPEX LZ77 codec can be extended to a complete on-device pipeline while preserving bit-perfect output and the claimed speeds without any hidden host-device transfers or post-processing steps.

What would settle it

Measure the pipeline on a different GPU while confirming no CPU involvement occurs between entropy and match stages and that the reported 260 GB/s throughput and 0.362 ms random-read latency are reproduced.

read the original abstract

Genomic archives grow faster than decompression keeps up: the European Nucleotide Archive holds tens of petabytes of fastq.gz, and gzip is fundamentally sequential. GPU decompressors (nvCOMP DEFLATE at ~50GB/s on A100) decode whole files with no random access; CPU genomic tools (CRAM, samtools) support region seeks but only at CPU speed. We extend ACEAPEX, an absolute-offset parallel LZ77 codec included in the official lzbench 2.3 release, with three contributions absent from our prior work. First, a full device-resident GPU decode pipeline (entropy and match resolution both on-device) reaching up to 260GB/s on FASTQ, closing the match-phase-only gap of the earlier paper. Second, position-invariant random access with a compact coordinate index: an arbitrary read decodes in 0.362ms, ~6x faster than warm samtools faidx, with a read-to-block index 6.3x smaller than a .fai. Third, a range-decode strategy that decouples output size from VRAM, sustaining 165.7GB/s on a 50GB genome where whole-file decode runs out of memory. All results are bit-perfect. We also measure Meta's open DietGPU ANS on H100 at 592GB/s decode, faster than the proprietary entropy stage we currently use, showing a fully open high-throughput stack is viable. Code is MIT-licensed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extends ACEAPEX to a claimed full on-device LZ77 pipeline with random access and range decode for genomics, but the abstract gives no pipeline details or verification steps so the 260 GB/s and device-resident claims stay uncheckable.

read the letter

The new pieces are the full device-resident decode (entropy plus match resolution on GPU), the compact position-invariant coordinate index for single-read access, and the range-decode approach that avoids loading entire files into VRAM. These are presented as missing from the prior ACEAPEX work and from nvCOMP or samtools baselines. The reported numbers—260 GB/s on FASTQ, 0.362 ms per arbitrary read, 6.3x smaller index than .fai, and 165.7 GB/s on a 50 GB file—are concrete and the code is said to be MIT-licensed, which is useful for anyone who wants to try the implementation.

The main weakness is that everything rests on the abstract. There is no kernel sequence, no memory-residency proof, no timing breakdown between entropy and match stages, and no description of how bit-perfect output was checked or how the benchmarks were run. The stress-test point about possible hidden host transfers is therefore still open; if those transfers exist, the end-to-end throughput and random-access latency numbers would not hold as stated. The comparison to DietGPU on H100 is noted but not expanded.

This is practical systems work aimed at people who already handle petabyte-scale FASTQ.gz archives on GPUs and need both speed and random access. A reader who wants to reproduce or extend the codec could get value from the open code once the experimental details are filled in.

It deserves peer review because the performance targets address a real sequential bottleneck and the index and range-decode ideas are straightforward to evaluate if the authors supply the missing methodology.

Referee Report

2 major / 1 minor

Summary. The paper extends ACEAPEX, an absolute-offset parallel LZ77 codec, with a full device-resident GPU decode pipeline (entropy decoding and match resolution both on-device) for FASTQ data. It reports up to 260 GB/s throughput, position-invariant random access decoding an arbitrary read in 0.362 ms with a read-to-block index 6.3x smaller than .fai, and a range-decode strategy sustaining 165.7 GB/s on a 50 GB genome without exceeding VRAM. All results are claimed bit-perfect; the work also benchmarks Meta's DietGPU ANS at 592 GB/s on H100 and releases code under MIT license.

Significance. If the device-resident claims and throughput numbers hold, the work would advance GPU-accelerated genomic decompression by closing the match-phase-only gap from prior ACEAPEX work, enabling random access and large-file handling without host transfers. The open MIT-licensed code is a clear strength supporting reproducibility and further development of open high-throughput stacks.

major comments (2)

[Abstract and pipeline description] Abstract and pipeline description: the central claim of a 'full device-resident GPU decode pipeline' with both entropy and match resolution on-device is load-bearing for the 260 GB/s and 165.7 GB/s figures, yet no kernel-launch sequence, memory residency proof, or timing breakdown separating the two stages is supplied to confirm absence of cudaMemcpy or host post-processing.
[Results and evaluation sections] Results and evaluation sections: concrete throughput, latency (0.362 ms), and index-size ratios are reported without measurement methodology, error bars, full hardware details, or bit-perfect verification procedure, preventing assessment of whether post-hoc tuning or selective reporting affects the numbers.

minor comments (1)

[Abstract] The proprietary entropy stage used for the main results is not named, while DietGPU is presented as an open alternative; adding this detail would clarify the comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the device-resident pipeline claim and the reported performance numbers require stronger supporting details for full substantiation and reproducibility. We address each major comment below and will revise the manuscript to incorporate the requested clarifications.

read point-by-point responses

Referee: [Abstract and pipeline description] Abstract and pipeline description: the central claim of a 'full device-resident GPU decode pipeline' with both entropy and match resolution on-device is load-bearing for the 260 GB/s and 165.7 GB/s figures, yet no kernel-launch sequence, memory residency proof, or timing breakdown separating the two stages is supplied to confirm absence of cudaMemcpy or host post-processing.

Authors: We agree that explicit documentation is needed to substantiate the full device-resident claim. While Section 3 of the manuscript describes the pipeline architecture, we will add a new subsection with the exact sequence of CUDA kernel launches (entropy decode followed by match resolution), VRAM residency proofs via allocation details, and a timing breakdown table separating the two stages. This will explicitly confirm the absence of cudaMemcpy or host post-processing during decode. revision: yes
Referee: [Results and evaluation sections] Results and evaluation sections: concrete throughput, latency (0.362 ms), and index-size ratios are reported without measurement methodology, error bars, full hardware details, or bit-perfect verification procedure, preventing assessment of whether post-hoc tuning or selective reporting affects the numbers.

Authors: We acknowledge that the current presentation lacks sufficient methodological transparency. We will expand the Results and Evaluation sections to include: full hardware specifications and software environment, the measurement protocol (including run counts, warm-up procedures, and timing methods), error bars or standard deviations for all reported figures, and a detailed description of the bit-perfect verification procedure (byte-for-byte comparison against a reference CPU decoder). These additions will allow independent verification of the results. revision: yes

Circularity Check

0 steps flagged

No circularity; performance claims are benchmark-driven with no self-referential derivations

full rationale

The manuscript reports empirical throughput, latency, and memory figures from GPU kernel runs on FASTQ and genome data. No equations, ansatzes, fitted parameters, or uniqueness theorems appear. References to prior ACEAPEX work describe the baseline being extended rather than supplying load-bearing premises that the new results reduce to by construction. The device-resident pipeline, random-access index, and range-decode strategy are presented as implementation contributions whose validity is asserted via bit-perfect output and measured speeds, not via any definitional or self-citation reduction. This is the expected non-finding for a systems-performance paper whose central claims are externally falsifiable benchmark numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an engineering systems contribution that implements and extends an existing LZ77 codec on GPU hardware; it introduces no new free parameters, mathematical axioms beyond standard LZ77 format assumptions, or postulated entities.

axioms (1)

domain assumption LZ77 decompression (entropy decoding and match resolution) can be performed entirely on-device while remaining bit-perfect.
Central to the full-pipeline claim; invoked when stating the 260 GB/s result and the extension of ACEAPEX.

pith-pipeline@v0.9.1-grok · 5799 in / 1435 out tokens · 23051 ms · 2026-06-26T19:20:32.506221+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Unified Position-Invariant Random Access Through Two Compression Layers via Absolute-Offset Coordinates: A Bit-Perfect Device-Resident Proof
cs.DC 2026-06 unverdicted novelty 5.0

Absolute-offset design enables unified position-invariant random access through entropy and match compression layers with one coordinate and bit-perfect verification.

Reference graph

Works this paper leans on

7 extracted references · 2 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

ACEAPEX: Parallel LZ77 Decoding via Encode-Time Absolute Offset Resolution

Y. Shavidze, “ACEAPEX: Parallel LZ77 Decod- ing via Encode-Time Absolute Offset Resolution,” arXiv:2606.04268, 2026. 4

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Massively-parallel lossless data decompression,

E. Sitaridi et al., “Massively-parallel lossless data decompression,”ICPP, 2016, pp. 242–247

2016
[3]

Rapidgzip: Parallel decompression and seeking in gzip streams using cache prefetching,

M. Köhler, T. Bingmann, and P. Sanders, “Rapidgzip: Parallel decompression and seeking in gzip streams using cache prefetching,”HPDC, 2023, pp. 295–307

2023
[4]

Recoil: Parallel rANS decoding with decoder-adaptive scalability,

T. Lin et al., “Recoil: Parallel rANS decoding with decoder-adaptive scalability,”ICS, 2023

2023
[5]

SAGe: Storage-Aware Genomic data compression,

“SAGe: Storage-Aware Genomic data compression,” arXiv:2504.03732, 2025

work page arXiv 2025
[6]

DietGPU: GPU-based lossless compression,

Meta, “DietGPU: GPU-based lossless compression,” open-source, 2022

2022
[7]

lzbench: in-memory bench- mark of open-source LZ77/LZSS/LZMA com- pressors,

P. Skibiński, “lzbench: in-memory bench- mark of open-source LZ77/LZSS/LZMA com- pressors,” Version 2.3, 2026. [Online]. Avail- able:https://github.com/inikep/lzbench/ releases/tag/v2.3 5

2026

[1] [1]

ACEAPEX: Parallel LZ77 Decoding via Encode-Time Absolute Offset Resolution

Y. Shavidze, “ACEAPEX: Parallel LZ77 Decod- ing via Encode-Time Absolute Offset Resolution,” arXiv:2606.04268, 2026. 4

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Massively-parallel lossless data decompression,

E. Sitaridi et al., “Massively-parallel lossless data decompression,”ICPP, 2016, pp. 242–247

2016

[3] [3]

Rapidgzip: Parallel decompression and seeking in gzip streams using cache prefetching,

M. Köhler, T. Bingmann, and P. Sanders, “Rapidgzip: Parallel decompression and seeking in gzip streams using cache prefetching,”HPDC, 2023, pp. 295–307

2023

[4] [4]

Recoil: Parallel rANS decoding with decoder-adaptive scalability,

T. Lin et al., “Recoil: Parallel rANS decoding with decoder-adaptive scalability,”ICS, 2023

2023

[5] [5]

SAGe: Storage-Aware Genomic data compression,

“SAGe: Storage-Aware Genomic data compression,” arXiv:2504.03732, 2025

work page arXiv 2025

[6] [6]

DietGPU: GPU-based lossless compression,

Meta, “DietGPU: GPU-based lossless compression,” open-source, 2022

2022

[7] [7]

lzbench: in-memory bench- mark of open-source LZ77/LZSS/LZMA com- pressors,

P. Skibiński, “lzbench: in-memory bench- mark of open-source LZ77/LZSS/LZMA com- pressors,” Version 2.3, 2026. [Online]. Avail- able:https://github.com/inikep/lzbench/ releases/tag/v2.3 5

2026