pith. sign in

arxiv: 2606.05389 · v1 · pith:J73RCBLAnew · submitted 2026-06-03 · 💻 cs.AI

Residual Modeling for High-Fidelity Learned Compression of Scientific Data

Pith reviewed 2026-06-28 05:52 UTC · model grok-4.3

classification 💻 cs.AI
keywords learned compressionscientific data compressionresidual codinghigh-fidelity compressionlossy compressionLorenzo differencingneural bias predictionspatiotemporal data
0
0 comments X

The pith

Tailored residual coders preserve learned compression advantage when correction dominates at high fidelity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that learned compressors lose their edge on scientific spatiotemporal data once per-block residual corrections must meet very tight NRMSE targets from 10^-6 to 10^-4, because the correction stream then dominates total rate. It introduces a residual-centric approach that treats the learned error field as structurally distinct and codes it with pipelines built for that field rather than the original data. LBRC applies deterministic adaptive quantization followed by 3D Lorenzo differencing, zigzag mapping, bit-plane coding, and entropy coding. NGLR adds a causal neural predictor that supplies a normalized bias to reduce the entropy of the remaining integer residual while keeping decoding fully deterministic. On E3SM, JHTDB, and ERA5 data these coders deliver 30-60% better ratios than prior GAE methods and outperform SZ in the high-fidelity regime.

Core claim

When global residual correction becomes rate-dominant in high-fidelity learned compression, residual representations tailored to learned-compressor residuals can preserve the advantage of learned compression. LBRC is a deterministic, training-free pipeline that adaptively quantizes the learned residual to the target NRMSE and losslessly encodes the resulting integer residual using 3D Lorenzo differencing, zigzag mapping, bit-plane coding, and entropy coding. NGLR adds a causal neural predictor that outputs a normalized bias for an integer-rounded Lorenzo prediction in the same deterministic integer pipeline, reducing the entropy of the remaining residual code while preserving deterministic d

What carries the argument

LBRC and NGLR residual coders that adaptively quantize and losslessly encode learned residuals via 3D Lorenzo differencing plus optional causal neural bias prediction.

If this is right

  • Compression ratios improve 30-60% over GAE at block-level NRMSE targets from 10^-6 to 10^-4.
  • NGLR adds a further 10-40% improvement over LBRC while remaining deterministic.
  • The methods become competitive with or superior to SZ across the evaluated high-fidelity regime.
  • The rate advantage of learned compressors is retained even when per-block residual correction dominates the bitstream.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same residual modeling strategy could be ported to learned compressors other than GAE.
  • Jointly optimizing the main compressor and the NGLR predictor might produce additional rate savings.
  • The deterministic integer pipelines support reproducible scientific workflows that require exact decoding.
  • The structural-difference premise could be tested directly by comparing residual statistics before and after the learned stage.

Load-bearing premise

The learned residual is structurally different from the original scientific field and therefore benefits from specialized coding pipelines such as 3D Lorenzo differencing or causal neural bias prediction.

What would settle it

If applying LBRC or NGLR to the residuals produced by GAE on the E3SM, JHTDB, and ERA5 datasets at NRMSE targets 10^-6 to 10^-4 yields no ratio improvement or underperforms standard residual correction, the claim that tailoring to learned residuals is essential would be falsified.

Figures

Figures reproduced from arXiv: 2606.05389 by Anand Rangarajan, Liangji Zhu, Sanjay Ranka.

Figure 1
Figure 1. Figure 1: Overview of the Lorenzo-Based Residual Coding (LBRC) pipeline. The base learned compressor produces [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Neural-Guided Lorenzo Residual Coding (NGLR). The Neural Lorenzo Bias Predictor fuses Guide features from the base reconstruction [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Compression ratio vs. achieved block-level NRMSE across the three datasets. Across datasets, NGLR gives the highest CR, GAE the lowest, and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Lossy compression is essential for massive spatiotemporal data from scientific simulations. Learned compressors can achieve high compression ratios at moderate accuracy targets, but their aggregate reconstruction losses do not guarantee accuracy for each block. Existing Guaranteed Autoencoder (GAE) methods add a per-block residual correction by retaining SVD/PCA-style coefficients until the target is met. This works at moderate tolerances, but in the high-fidelity regime with block-level NRMSE from 10^-6 to 10^-4, the number of retained coefficients grows quickly and the correction stream dominates the total rate. We propose a residual-centric view: the learned residual is structurally different from the original scientific field and should be coded with a representation designed for that residual. We introduce two residual coders. LBRC is a deterministic, training-free pipeline that adaptively quantizes the learned residual to the target NRMSE and losslessly encodes the resulting integer residual using 3D Lorenzo differencing, zigzag mapping, bit-plane coding, and entropy coding. NGLR adds a causal neural predictor that outputs a normalized bias for an integer-rounded Lorenzo prediction in the same deterministic integer pipeline, reducing the entropy of the remaining residual code while preserving deterministic decoding. The predictor weights are serialized and counted in the bitstream. Across E3SM, JHTDB, and ERA5 at block-level NRMSE targets from 10^-6 to 10^-4, LBRC improves compression ratio over GAE by 30-60% and is broadly competitive with SZ. NGLR adds a further 10-40% over LBRC and outperforms SZ in the evaluated high-fidelity regime. These results show that residual representations tailored to learned-compressor residuals can preserve the advantage of learned compression when global residual correction becomes rate-dominant.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a residual-centric approach to high-fidelity learned compression of scientific spatiotemporal data. It argues that when per-block residual correction dominates the rate at tight NRMSE targets (10^{-6} to 10^{-4}), standard SVD/PCA-style correction in Guaranteed Autoencoders (GAE) becomes inefficient; instead, the learned residual should be coded with pipelines (LBRC: adaptive quantization + 3D Lorenzo differencing, zigzag, bit-plane, entropy coding; NGLR: causal neural bias predictor on top of the same integer pipeline) that exploit its structure. Experiments on E3SM, JHTDB, and ERA5 report 30-60% ratio gains for LBRC over GAE and a further 10-40% for NGLR, making it competitive with or better than SZ while preserving deterministic decoding.

Significance. If the central claim holds, the work demonstrates that learned compressors can retain their advantage in the high-fidelity regime by shifting from generic residual correction to residual-specific lossless coding; the deterministic, training-free character of LBRC and the explicit serialization of NGLR weights are concrete strengths that support reproducibility. The results are empirical rather than axiomatic, but the multi-dataset evaluation at stated NRMSE targets provides a falsifiable basis for the residual-modeling thesis.

major comments (2)
  1. [Abstract / Results] Abstract and results sections: the claim that 'the learned residual is structurally different from the original scientific field' and therefore benefits from LBRC/NGLR is load-bearing for the central thesis, yet the manuscript reports no control experiment that applies the identical LBRC (or NGLR) pipeline directly to the raw fields at the same per-block NRMSE targets; without this ablation the 30-60% gains over GAE could arise from the quantization-to-NRMSE step or from the integer pipeline being generally effective rather than from any property unique to autoencoder residuals.
  2. [Results] Results section (dataset and baseline descriptions): full configuration details for the GAE SVD/PCA correction (number of retained coefficients per block, exact NRMSE enforcement procedure) and for SZ are not provided, nor are error bars or per-dataset variance on the reported compression ratios; this weakens verification of the 'broadly competitive with SZ' and 'outperforms SZ' statements at the high-fidelity end.
minor comments (2)
  1. [Abstract] Notation: the abstract introduces LBRC and NGLR without an explicit forward reference to the section that defines the integer pipeline and the causal predictor; a short methods overview paragraph would improve readability.
  2. [Methods (NGLR)] The paper states that predictor weights are serialized and counted in the bitstream, but does not specify the exact quantization or encoding format used for those weights; this detail belongs in the NGLR description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below, indicating planned revisions to the manuscript where appropriate.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and results sections: the claim that 'the learned residual is structurally different from the original scientific field' and therefore benefits from LBRC/NGLR is load-bearing for the central thesis, yet the manuscript reports no control experiment that applies the identical LBRC (or NGLR) pipeline directly to the raw fields at the same per-block NRMSE targets; without this ablation the 30-60% gains over GAE could arise from the quantization-to-NRMSE step or from the integer pipeline being generally effective rather than from any property unique to autoencoder residuals.

    Authors: The core experimental comparison is between GAE (which applies SVD/PCA correction to the identical learned residuals) and LBRC/NGLR (which applies the proposed pipelines to those same residuals). The reported 30-60% gains therefore isolate the effect of the residual coder itself rather than the quantization step or the integer pipeline in isolation. The structural difference claim is supported by the observation that autoencoder residuals exhibit lower dynamic range and altered spatial correlations compared with the original fields, which the Lorenzo-based and neural-bias pipelines are designed to exploit. A control applying LBRC directly to raw fields would evaluate a different (non-learned) compressor and is not required to substantiate the advantage of residual-specific coding within the learned-compression pipeline. We will nevertheless revise the manuscript to make this distinction explicit and to add a short discussion of the residual statistics that motivate the design choices. revision: partial

  2. Referee: [Results] Results section (dataset and baseline descriptions): full configuration details for the GAE SVD/PCA correction (number of retained coefficients per block, exact NRMSE enforcement procedure) and for SZ are not provided, nor are error bars or per-dataset variance on the reported compression ratios; this weakens verification of the 'broadly competitive with SZ' and 'outperforms SZ' statements at the high-fidelity end.

    Authors: We agree that these details are necessary for full reproducibility. In the revised version we will add the precise GAE configuration (number of retained SVD coefficients per block and the exact per-block NRMSE enforcement algorithm) together with the SZ parameter settings used in the comparisons. Because the methods are deterministic and the reported figures are obtained from single runs on fixed data partitions, per-run variance is not available; we will instead tabulate the per-dataset compression ratios explicitly and note the deterministic character of LBRC and NGLR. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical methods with explicit pipelines

full rationale

The paper advances two deterministic, training-free residual coders (LBRC using 3D Lorenzo + zigzag/bit-plane, NGLR adding causal neural bias) and reports empirical compression ratios on E3SM/JHTDB/ERA5 at NRMSE 1e-6 to 1e-4. No equations, uniqueness theorems, or predictions are shown that reduce by construction to fitted parameters or self-citations. The structural-difference premise is stated as a modeling choice whose validity is tested via direct rate comparisons rather than derived from prior author work. All load-bearing claims rest on reproducible integer pipelines and external baselines (GAE, SZ), satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that residuals from learned compressors possess exploitable structure distinct from the original field; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption The learned residual is structurally different from the original scientific field
    Explicit premise for designing specialized residual coders rather than reusing the original-field compressor.

pith-pipeline@v0.9.1-grok · 5854 in / 1229 out tokens · 33290 ms · 2026-06-28T05:52:50.318854+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 1 canonical work pages

  1. [1]

    Use Cases of Lossy Compression for Floating-Point Data in Scientific Data Sets,

    F. Cappello, S. Di, S. Li, X. Liang, A. M. Gok, D. Tao, C. H. Yoon, X.-C. Wu, Y . Alexeev, and F. T. Chong, “Use Cases of Lossy Compression for Floating-Point Data in Scientific Data Sets,”The International Journal of High Performance Computing Applications, vol. 33, no. 6, pp. 1201– 1220, 2019

  2. [2]

    Error-controlled lossy compression optimized for high compression ratios of scientific datasets,

    X. Liang, S. Di, D. Tao, S. Li, S. Li, H. Guo, Z. Chen, and F. Cappello, “Error-controlled lossy compression optimized for high compression ratios of scientific datasets,” inProc. IEEE Int. Conf. Big Data, 2018, pp. 438–447

  3. [3]

    SZ3: A modular framework for composing prediction-based error-bounded lossy compressors,

    X. Liang, K. Zhao, S. Di, S. Li, R. Underwood, A. M. Gok, J. Tian, J. Deng, J. C. Calhoun, D. Tao, Z. Chen, and F. Cappello, “SZ3: A modular framework for composing prediction-based error-bounded lossy compressors,”IEEE Trans. Big Data, vol. 9, no. 2, pp. 485–498, 2023

  4. [4]

    Fixed-rate compressed floating-point arrays,

    P. Lindstrom, “Fixed-rate compressed floating-point arrays,”IEEE Trans. Vis. Comput. Graphics, vol. 20, no. 12, pp. 2674–2683, 2014

  5. [5]

    Stability analysis of inline ZFP compression for floating-point data in iterative methods,

    A. Fox, J. Diffenderfer, J. Hittinger, G. Sanders, and P. Lindstrom, “Stability analysis of inline ZFP compression for floating-point data in iterative methods,”SIAM J. Sci. Comput., vol. 42, no. 5, pp. A2701– A2730, 2020

  6. [6]

    Multilevel techniques for compression and reduction of scientific data—The mul- tivariate case,

    M. Ainsworth, O. Tugluk, B. Whitney, and S. Klasky, “Multilevel techniques for compression and reduction of scientific data—The mul- tivariate case,”SIAM J. Sci. Comput., vol. 41, no. 2, pp. A1278–A1303, 2019

  7. [7]

    MGARD: A multigrid framework for high-performance, error-controlled data compression and refactoring,

    Q. Gong, J. Chen, B. Whitney, X. Liang, V . Reshniak, T. Banerjee, J. Lee, A. Rangarajan, L. Wan, N. Vidal et al., “MGARD: A multigrid framework for high-performance, error-controlled data compression and refactoring,”SoftwareX, vol. 24, p. 101590, 2023

  8. [8]

    FAZ: A flexible auto-tuned modular error-bounded compression framework for scientific data,

    J. Liu, S. Di, K. Zhao, X. Liang, Z. Chen, and F. Cappello, “FAZ: A flexible auto-tuned modular error-bounded compression framework for scientific data,” inProc. Int. Conf. Supercomputing, 2023, pp. 1–13

  9. [9]

    Out-of-core compression and decompression of largen-dimensional scalar fields,

    L. Ibarria, P. Lindstrom, J. Rossignac, and A. Szymczak, “Out-of-core compression and decompression of largen-dimensional scalar fields,” Computer Graphics Forum, vol. 22, no. 3, pp. 343–348, 2003

  10. [10]

    High-ratio lossy compression: Exploring the autoencoder to compress scientific data,

    T. Liu, J. Wang, Q. Liu, S. Alibhai, T. Lu, and X. He, “High-ratio lossy compression: Exploring the autoencoder to compress scientific data,” IEEE Transactions on Big Data, vol. 9, no. 1, pp. 22–36, 2021

  11. [11]

    Using neural networks for two- dimensional scientific data compression,

    L. Hayne, J. Clyne, and S. Li, “Using neural networks for two- dimensional scientific data compression,” inProc. IEEE Int. Conf. Big Data, 2021, pp. 2956–2965

  12. [12]

    Exploring autoencoder-based error-bounded compression for scientific data,

    J. Liu, S. Di, K. Zhao, S. Jin, D. Tao, X. Liang, Z. Chen, and F. Cappello, “Exploring autoencoder-based error-bounded compression for scientific data,” inProc. IEEE Int. Conf. Cluster Comput., 2021, pp. 294–306

  13. [13]

    Error-bounded learned scientific data compression with preservation of derived quantities,

    J. Lee, Q. Gong, J. Choi, T. Banerjee, S. Klasky, S. Ranka, and A. Rangarajan, “Error-bounded learned scientific data compression with preservation of derived quantities,”Appl. Sci., vol. 12, no. 13, p. 6718, 2022

  14. [14]

    Nonlinear-by-linear: Guaranteeing error bounds in compressive autoencoders,

    J. Lee, A. Rangarajan, and S. Ranka, “Nonlinear-by-linear: Guaranteeing error bounds in compressive autoencoders,” inProc. 15th Int. Conf. Contemporary Computing (IC3), 2023, pp. 552–561

  15. [15]

    Joint autoregressive and hierarchical priors for learned image compression,

    D. Minnen, J. Ball ´e, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” inAdv. Neural Inf. Process. Syst., vol. 31, 2018

  16. [16]

    Conditional image generation with PixelCNN decoders,

    A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al., “Conditional image generation with PixelCNN decoders,”Advances in Neural Information Processing Systems, vol. 29, 2016

  17. [17]

    Variational image compression with a scale hyperprior,

    J. Ball ´e, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” inInt. Conf. Learn. Representations, 2018

  18. [18]

    Foundation model for lossy compression of spatiotemporal scientific data,

    X. Li, J. Lee, A. Rangarajan, and S. Ranka, “Foundation model for lossy compression of spatiotemporal scientific data,” inProc. Pacific-Asia Conf. Knowl. Discovery and Data Mining (PAKDD), 2025, pp. 368– 380

  19. [19]

    Generative latent diffusion for efficient spatiotemporal data reduction,

    X. Li, L. Zhu, A. Rangarajan, and S. Ranka, “Generative latent diffusion for efficient spatiotemporal data reduction,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025, pp. 1980–1991

  20. [20]

    CAESAR: A unified framework for foundation and generative models for efficient compression of scientific data,

    X. Li, L. Zhu, J. Lee, R. Sengupta, S. Klasky, S. Ranka, and A. Ran- garajan, “CAESAR: A unified framework for foundation and generative models for efficient compression of scientific data,”Applied Sciences, vol. 15, no. 16, p. 8977, 2025

  21. [21]

    Attention based machine learning methods for data reduction with guaranteed error bounds,

    X. Li, J. Lee, A. Rangarajan, and S. Ranka, “Attention based machine learning methods for data reduction with guaranteed error bounds,” in Proc. IEEE Int. Conf. Big Data, 2024, pp. 1039–1048

  22. [22]

    The DOE E3SM coupled model version 1: Overview and evaluation at standard resolution,

    J.-C. Golaz, P. M. Caldwell, L. P. Van Roekel et al., “The DOE E3SM coupled model version 1: Overview and evaluation at standard resolution,”J. Adv. Model. Earth Syst., vol. 11, no. 7, pp. 2089–2129, 2019

  23. [23]

    The ERA5 global reanalysis,

    H. Hersbach, B. Bell, P. Berrisford et al., “The ERA5 global reanalysis,” Quarterly Journal of the Royal Meteorological Society, vol. 146, no. 730, pp. 1999–2049, 2020

  24. [24]

    A public turbulence database cluster and applications to study Lagrangian evolution of velocity increments in turbulence,

    Y . Li, E. Perlman, M. Wan, Y . Yang, C. Meneveau, R. Burns, S. Chen, A. Szalay, and G. Eyink, “A public turbulence database cluster and applications to study Lagrangian evolution of velocity increments in turbulence,”J. Turbulence, vol. 9, p. N31, 2008

  25. [25]

    Direct numerical simulations of ignition of a leann-heptane/air mixture with temperature inhomogeneities at constant volume: Parametric study,

    C. S. Yoo, T. Lu, J. H. Chen, and C. K. Law, “Direct numerical simulations of ignition of a leann-heptane/air mixture with temperature inhomogeneities at constant volume: Parametric study,”Combustion and Flame, vol. 158, no. 9, pp. 1727–1741, 2011

  26. [26]

    LPCNet: Improving neural speech synthe- sis through linear prediction,

    J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech synthe- sis through linear prediction,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 5891– 5895, doi: 10.1109/ICASSP.2019.8682804