pith. sign in

arxiv: 2606.18900 · v1 · pith:I6QQIS3Qnew · submitted 2026-06-17 · 💻 cs.DC

Compressed-Resident Genomics: Full-Pipeline Device-Resident GPU LZ77 Decode with Position-Invariant Random Access

Pith reviewed 2026-06-26 19:20 UTC · model grok-4.3

classification 💻 cs.DC
keywords GPU decompressionLZ77genomicsrandom accessFASTQdevice-resident pipelineACEAPEX
0
0 comments X

The pith

A full device-resident GPU LZ77 pipeline decodes genomic data at 260 GB/s while supporting random access to individual reads in 0.362 ms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends the ACEAPEX absolute-offset parallel LZ77 codec to run its complete decompression pipeline entirely on the GPU, performing both entropy decoding and match resolution without host intervention. This produces bit-perfect output at up to 260 GB/s on FASTQ files. A compact coordinate index enables position-invariant random access, decoding any arbitrary read in 0.362 ms. For genomes too large to fit in VRAM, a range-decode strategy sustains 165.7 GB/s on 50 GB data. The approach also notes that an open entropy codec can reach even higher rates.

Core claim

By extending ACEAPEX into a full device-resident GPU decode pipeline, entropy decoding and match resolution both stay on the device to reach 260 GB/s on FASTQ, a compact coordinate index supports position-invariant random access that decodes an arbitrary read in 0.362 ms, and a range-decode strategy decouples output size from VRAM to sustain 165.7 GB/s on a 50 GB genome, all while remaining bit-perfect.

What carries the argument

The device-resident GPU decode pipeline that performs entropy decoding and match resolution without host intervention, built on the absolute-offset parallel LZ77 codec ACEAPEX.

If this is right

  • Full on-device processing removes host-device transfer overhead during genomic decompression.
  • Position-invariant random access allows direct extraction of individual reads without decompressing preceding data.
  • Range decoding enables processing of genomes larger than available VRAM at 165.7 GB/s.
  • The smaller read-to-block index reduces storage overhead compared with standard .fai files.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same on-device pipeline structure could apply to other LZ77-based formats that currently force full sequential decompression.
  • Combining the pipeline with the faster open DietGPU entropy stage would create an entirely open high-throughput stack for compressed genomics.
  • Random-access performance at sub-millisecond latency per read could support interactive queries over petabyte-scale archives without first materializing decompressed copies.

Load-bearing premise

The ACEAPEX LZ77 codec can be extended to a complete on-device pipeline while preserving bit-perfect output and the claimed speeds without any hidden host-device transfers or post-processing steps.

What would settle it

Measure the pipeline on a different GPU while confirming no CPU involvement occurs between entropy and match stages and that the reported 260 GB/s throughput and 0.362 ms random-read latency are reproduced.

read the original abstract

Genomic archives grow faster than decompression keeps up: the European Nucleotide Archive holds tens of petabytes of fastq.gz, and gzip is fundamentally sequential. GPU decompressors (nvCOMP DEFLATE at ~50GB/s on A100) decode whole files with no random access; CPU genomic tools (CRAM, samtools) support region seeks but only at CPU speed. We extend ACEAPEX, an absolute-offset parallel LZ77 codec included in the official lzbench 2.3 release, with three contributions absent from our prior work. First, a full device-resident GPU decode pipeline (entropy and match resolution both on-device) reaching up to 260GB/s on FASTQ, closing the match-phase-only gap of the earlier paper. Second, position-invariant random access with a compact coordinate index: an arbitrary read decodes in 0.362ms, ~6x faster than warm samtools faidx, with a read-to-block index 6.3x smaller than a .fai. Third, a range-decode strategy that decouples output size from VRAM, sustaining 165.7GB/s on a 50GB genome where whole-file decode runs out of memory. All results are bit-perfect. We also measure Meta's open DietGPU ANS on H100 at 592GB/s decode, faster than the proprietary entropy stage we currently use, showing a fully open high-throughput stack is viable. Code is MIT-licensed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper extends ACEAPEX, an absolute-offset parallel LZ77 codec, with a full device-resident GPU decode pipeline (entropy decoding and match resolution both on-device) for FASTQ data. It reports up to 260 GB/s throughput, position-invariant random access decoding an arbitrary read in 0.362 ms with a read-to-block index 6.3x smaller than .fai, and a range-decode strategy sustaining 165.7 GB/s on a 50 GB genome without exceeding VRAM. All results are claimed bit-perfect; the work also benchmarks Meta's DietGPU ANS at 592 GB/s on H100 and releases code under MIT license.

Significance. If the device-resident claims and throughput numbers hold, the work would advance GPU-accelerated genomic decompression by closing the match-phase-only gap from prior ACEAPEX work, enabling random access and large-file handling without host transfers. The open MIT-licensed code is a clear strength supporting reproducibility and further development of open high-throughput stacks.

major comments (2)
  1. [Abstract and pipeline description] Abstract and pipeline description: the central claim of a 'full device-resident GPU decode pipeline' with both entropy and match resolution on-device is load-bearing for the 260 GB/s and 165.7 GB/s figures, yet no kernel-launch sequence, memory residency proof, or timing breakdown separating the two stages is supplied to confirm absence of cudaMemcpy or host post-processing.
  2. [Results and evaluation sections] Results and evaluation sections: concrete throughput, latency (0.362 ms), and index-size ratios are reported without measurement methodology, error bars, full hardware details, or bit-perfect verification procedure, preventing assessment of whether post-hoc tuning or selective reporting affects the numbers.
minor comments (1)
  1. [Abstract] The proprietary entropy stage used for the main results is not named, while DietGPU is presented as an open alternative; adding this detail would clarify the comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the device-resident pipeline claim and the reported performance numbers require stronger supporting details for full substantiation and reproducibility. We address each major comment below and will revise the manuscript to incorporate the requested clarifications.

read point-by-point responses
  1. Referee: [Abstract and pipeline description] Abstract and pipeline description: the central claim of a 'full device-resident GPU decode pipeline' with both entropy and match resolution on-device is load-bearing for the 260 GB/s and 165.7 GB/s figures, yet no kernel-launch sequence, memory residency proof, or timing breakdown separating the two stages is supplied to confirm absence of cudaMemcpy or host post-processing.

    Authors: We agree that explicit documentation is needed to substantiate the full device-resident claim. While Section 3 of the manuscript describes the pipeline architecture, we will add a new subsection with the exact sequence of CUDA kernel launches (entropy decode followed by match resolution), VRAM residency proofs via allocation details, and a timing breakdown table separating the two stages. This will explicitly confirm the absence of cudaMemcpy or host post-processing during decode. revision: yes

  2. Referee: [Results and evaluation sections] Results and evaluation sections: concrete throughput, latency (0.362 ms), and index-size ratios are reported without measurement methodology, error bars, full hardware details, or bit-perfect verification procedure, preventing assessment of whether post-hoc tuning or selective reporting affects the numbers.

    Authors: We acknowledge that the current presentation lacks sufficient methodological transparency. We will expand the Results and Evaluation sections to include: full hardware specifications and software environment, the measurement protocol (including run counts, warm-up procedures, and timing methods), error bars or standard deviations for all reported figures, and a detailed description of the bit-perfect verification procedure (byte-for-byte comparison against a reference CPU decoder). These additions will allow independent verification of the results. revision: yes

Circularity Check

0 steps flagged

No circularity; performance claims are benchmark-driven with no self-referential derivations

full rationale

The manuscript reports empirical throughput, latency, and memory figures from GPU kernel runs on FASTQ and genome data. No equations, ansatzes, fitted parameters, or uniqueness theorems appear. References to prior ACEAPEX work describe the baseline being extended rather than supplying load-bearing premises that the new results reduce to by construction. The device-resident pipeline, random-access index, and range-decode strategy are presented as implementation contributions whose validity is asserted via bit-perfect output and measured speeds, not via any definitional or self-citation reduction. This is the expected non-finding for a systems-performance paper whose central claims are externally falsifiable benchmark numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an engineering systems contribution that implements and extends an existing LZ77 codec on GPU hardware; it introduces no new free parameters, mathematical axioms beyond standard LZ77 format assumptions, or postulated entities.

axioms (1)
  • domain assumption LZ77 decompression (entropy decoding and match resolution) can be performed entirely on-device while remaining bit-perfect.
    Central to the full-pipeline claim; invoked when stating the 260 GB/s result and the extension of ACEAPEX.

pith-pipeline@v0.9.1-grok · 5799 in / 1435 out tokens · 23051 ms · 2026-06-26T19:20:32.506221+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Unified Position-Invariant Random Access Through Two Compression Layers via Absolute-Offset Coordinates: A Bit-Perfect Device-Resident Proof

    cs.DC 2026-06 unverdicted novelty 5.0

    Absolute-offset design enables unified position-invariant random access through entropy and match compression layers with one coordinate and bit-perfect verification.

Reference graph

Works this paper leans on

7 extracted references · 2 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    ACEAPEX: Parallel LZ77 Decoding via Encode-Time Absolute Offset Resolution

    Y. Shavidze, “ACEAPEX: Parallel LZ77 Decod- ing via Encode-Time Absolute Offset Resolution,” arXiv:2606.04268, 2026. 4

  2. [2]

    Massively-parallel lossless data decompression,

    E. Sitaridi et al., “Massively-parallel lossless data decompression,”ICPP, 2016, pp. 242–247

  3. [3]

    Rapidgzip: Parallel decompression and seeking in gzip streams using cache prefetching,

    M. Köhler, T. Bingmann, and P. Sanders, “Rapidgzip: Parallel decompression and seeking in gzip streams using cache prefetching,”HPDC, 2023, pp. 295–307

  4. [4]

    Recoil: Parallel rANS decoding with decoder-adaptive scalability,

    T. Lin et al., “Recoil: Parallel rANS decoding with decoder-adaptive scalability,”ICS, 2023

  5. [5]

    SAGe: Storage-Aware Genomic data compression,

    “SAGe: Storage-Aware Genomic data compression,” arXiv:2504.03732, 2025

  6. [6]

    DietGPU: GPU-based lossless compression,

    Meta, “DietGPU: GPU-based lossless compression,” open-source, 2022

  7. [7]

    lzbench: in-memory bench- mark of open-source LZ77/LZSS/LZMA com- pressors,

    P. Skibiński, “lzbench: in-memory bench- mark of open-source LZ77/LZSS/LZMA com- pressors,” Version 2.3, 2026. [Online]. Avail- able:https://github.com/inikep/lzbench/ releases/tag/v2.3 5