pith. machine review for the scientific record. sign in

arxiv: 2605.08731 · v1 · submitted 2026-05-09 · 💻 cs.PF · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Single-Thread JPEG Decoder Benchmarks Mis-Evaluate ML Data Loaders

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:59 UTC · model grok-4.3

classification 💻 cs.PF cs.LG
keywords JPEG decoderPyTorch DataLoaderbenchmark evaluationsingle-thread vs multi-workerdata loadingCPU performanceImageNetML training pipeline
0
0 comments X

The pith

Single-thread JPEG decoder benchmarks produce different rankings than measurements inside PyTorch DataLoaders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

JPEG decode speed matters for machine learning training because data loading can limit how fast models learn. This paper tests whether single-thread microbenchmarks, which are common for choosing decoders, actually predict performance when the decoder runs as part of a PyTorch DataLoader with multiple workers. Across five CPU platforms and twelve decoder libraries, the relative ordering of decoders shifts depending on the measurement method. Decoders that look slow in isolation can perform well in the parallel DataLoader context, and the best choice for zero-skip workloads turns out to be torchvision or simplejpeg. The results also show that conclusions about the number of workers vary by CPU type.

Core claim

The central discovery is that the choice of evaluation protocol alters which decoder appears fastest. Single-thread throughput rankings do not match DataLoader throughput rankings at worker counts of 0, 2, 4, and 8. Specific examples include imageio moving from ninth to top tier on Neoverse V2 and torchvision rising to the top on Zen 4 when tested in the DataLoader. For PyTorch DataLoader use, torchvision achieves the highest mean normalized throughput while simplejpeg has the highest minimum, and OpenCV performs robustly above 90% of the best on every CPU.

What carries the argument

The protocol comparison between isolated single-thread decoding throughput and integrated PyTorch DataLoader throughput with controlled worker counts.

If this is right

  • DataLoader measurements can elevate decoders that rank low in single-thread tests, such as imageio on Neoverse V2.
  • torchvision can improve from mid-tier single-thread to top DataLoader performance on certain CPUs like Zen 4.
  • Worker count recommendations for peak throughput differ between similar CPUs such as Zen 4 and Zen 5.
  • TensorFlow shows a pronounced single-thread slowdown on ARM CPUs compared to other platforms.
  • torchvision and simplejpeg form the strongest zero-skip tier for PyTorch DataLoader workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practitioners should re-evaluate decoder choices using their specific DataLoader setup rather than single-thread published numbers.
  • Decoder performance in loaders may depend on how each implementation handles memory access and thread coordination with the DataLoader.
  • Comparable evaluation gaps could exist for other common ML data operations like image augmentations or non-JPEG formats.
  • Adopting integrated benchmarks might improve training efficiency by avoiding suboptimal decoder selections on target hardware.

Load-bearing premise

The assumption that single-thread throughput is a reliable proxy for performance inside a multi-worker DataLoader without accounting for interactions with the loader's implementation or hardware specifics.

What would settle it

A follow-up test on a sixth CPU architecture where single-thread rankings match DataLoader rankings for all decoders, or the discovery of a decoder that maintains its relative position across both evaluation methods on the existing platforms.

Figures

Figures reproduced from arXiv: 2605.08731 by Vladimir Iglovikov.

Figure 1
Figure 1. Figure 1: Protocol changes decoder recommendations. Bars show rank change from single-thread memory [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Worker-count scaling differs between AMD generations. Bars show percent throughput change [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: TensorFlow JPEG decode shows a large ARM penalty. Bars show TensorFlow single-thread [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: DataLoader speed and observed JPEG robustness. Bar length is each decoder’s mean peak [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

JPEG decode is routine ML infrastructure, but Python decoder choices are often justified by single-process, single-thread microbenchmarks. We audit this evaluation assumption with twelve Python-accessible JPEG decode paths on five matched 16 vCPU Google Cloud CPUs: Intel Emerald Rapids, AMD Zen 4, AMD Zen 5, ARM Neoverse V2, and ARM Neoverse N1. ImageNet validation is the workload, not a new dataset contribution: each run decodes the full 50,000-image split from memory and reports single-thread throughput for all decoders, PyTorch DataLoader throughput for eligible decoders at worker counts {0,2,4,8}, and decoder skip behavior. The evaluation protocol changes the supported conclusion. On Neoverse V2, imageio is ninth in single-thread throughput yet lands in the top DataLoader tier with torchvision; on Zen 4, torchvision rises from seventh single-thread to the top measured DataLoader tier; on Neoverse N1, imagecodecs is the single-thread leader but fourth at peak DataLoader throughput. We also find that worker-count conclusions differ between Zen 4 and Zen 5, TensorFlow has a large single-thread ARM penalty, and strict libjpeg-turbo-family wrappers reject the same rare ImageNet JPEG. For PyTorch DataLoader workloads, torchvision and simplejpeg form the strongest measured zero-skip tier: torchvision has the highest mean normalized throughput, while simplejpeg has the highest minimum. OpenCV remains a robust general-purpose fallback above 90% of the platform-local winner on every tested CPU. We release raw JSON, generated tables/figures, and an executable local/cloud benchmark framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims that single-thread JPEG decoder microbenchmarks are misleading for ML data loader selection because performance rankings reverse under realistic PyTorch DataLoader workloads (workers in {0,2,4,8}) on the ImageNet validation set. It demonstrates this across five 16-vCPU platforms with twelve Python JPEG paths, reports concrete reversals (e.g., imageio ninth single-thread but top-tier DataLoader on Neoverse V2; torchvision seventh single-thread but top DataLoader on Zen 4), notes platform-specific behaviors including worker-count differences between Zen 4 and Zen 5, and releases raw JSON, tables, figures, and an executable benchmark framework.

Significance. If the empirical results hold, the work is significant for ML systems and performance evaluation: it supplies a workload-matched, reproducible protocol on a standard dataset rather than synthetic microbenchmarks, directly affecting decoder choice in training pipelines. The release of artifacts, explicit skip reporting, and cross-architecture coverage are strengths that enable verification and adoption of better evaluation practices.

minor comments (3)
  1. The abstract states that 'strict libjpeg-turbo-family wrappers reject the same rare ImageNet JPEG' and reports decoder skips; the paper should include an explicit table or subsection listing per-decoder skip counts and how skips are excluded from throughput calculations to ensure the DataLoader numbers are directly comparable to single-thread results.
  2. The methodology description should specify the exact PyTorch DataLoader parameters (prefetch_factor, pin_memory, persistent_workers) and timing method (e.g., wall-clock per batch or per image) used for the multi-worker measurements, as these choices can interact with decoder and memory behavior.
  3. Figure or table captions for the ranking reversals should include the normalized throughput values (or at least the top-three and bottom-three) so readers can assess the magnitude of the observed shifts without needing the raw JSON.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment, recognition of the work's significance for ML systems evaluation, and recommendation for minor revision. The emphasis on reproducible, workload-matched protocols and artifact release is appreciated.

Circularity Check

0 steps flagged

No significant circularity in empirical benchmarking study

full rationale

This paper is a direct empirical benchmarking study that measures and compares JPEG decoder throughputs under single-thread microbenchmarks versus PyTorch DataLoader multi-worker conditions on five CPUs using the fixed ImageNet validation workload. All reported rankings, throughput numbers, and protocol-difference conclusions rest on concrete, reproducible runs with released raw JSON data and an executable framework; there are no equations, fitted parameters, derivations, or self-citation chains that reduce any result to a quantity defined by the paper's own inputs. The central claim that evaluation protocol changes supported conclusions is therefore independently verifiable from the measurements themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on direct throughput measurements of standard libraries against the public ImageNet validation set rather than on fitted models or new theoretical constructs.

axioms (1)
  • domain assumption ImageNet validation set serves as a representative workload for JPEG decoding in ML training
    Explicitly used as the workload without introducing a new dataset.

pith-pipeline@v0.9.0 · 5602 in / 1335 out tokens · 84549 ms · 2026-05-12T01:59:52.120853+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    TensorFlow.https://www.tensorflow.org, 2024

    Google Brain Team. TensorFlow.https://www.tensorflow.org, 2024. Accessed 2026-05-02

  2. [2]

    Google Cloud compute engine machine families (c4 / c4d / c4a / t2a documentation)

    Google Cloud. Google Cloud compute engine machine families (c4 / c4d / c4a / t2a documentation). https://cloud.google.com/compute/docs/machine-types, 2026. Accessed 2026-05-02

  3. [3]

    FFCV: Accelerating training by removing data bottlenecks

    Guillaume Leclerc, Andrew Ilyas, Logan Engstrom, Sung Min Park, Hadi Salman, and Aleksander Mądry. FFCV: Accelerating training by removing data bottlenecks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12011–12020, 2023

  4. [4]

    Webdataset: a PyTorch dataset (WebDataset) designed for streaming training.https://github.com/webdataset/webdataset, 2021

    Bob McElrath and Thomas Breuel. Webdataset: a PyTorch dataset (WebDataset) designed for streaming training.https://github.com/webdataset/webdataset, 2021. Accessed 2026-05-02

  5. [5]

    NVIDIA DALI: GPU-accelerated data loading and image augmentation.https: //developer.nvidia.com/dali, 2024

    NVIDIA Corporation. NVIDIA DALI: GPU-accelerated data loading and image augmentation.https: //developer.nvidia.com/dali, 2024. Accessed 2026-05-02

  6. [6]

    nvJPEG: GPU-accelerated JPEG decode

    NVIDIA Corporation. nvJPEG: GPU-accelerated JPEG decode. https://developer.nvidia.com/ nvjpeg, 2024. Accessed 2026-05-02

  7. [7]

    Open source computer vision library (OpenCV).https://opencv.org, 2024

    OpenCV Team. Open source computer vision library (OpenCV).https://opencv.org, 2024. Accessed 2026-05-02

  8. [8]

    Pillow: the friendly PIL fork.https://pillow.readthedocs.io/en/stable/, 2024

    Pillow Developers. Pillow: the friendly PIL fork.https://pillow.readthedocs.io/en/stable/, 2024. Accessed 2026-05-02

  9. [9]

    PyTorch.https://pytorch.org, 2024

    PyTorch Team. PyTorch.https://pytorch.org, 2024. Accessed 2026-05-02

  10. [10]

    Torchdata.https://github.com/pytorch/data, 2024

    PyTorch Team. Torchdata.https://github.com/pytorch/data, 2024. Accessed 2026-05-02

  11. [11]

    torchvision.https://pytorch.org/vision, 2024

    PyTorch Team. torchvision.https://pytorch.org/vision, 2024. Accessed 2026-05-02

  12. [12]

    Kornia: differentiable computer vision in PyTorch.https://kornia.github.io, 2024

    Edgar Riba, Dmytro Mishkin, Daniel Ponsa, Ethan Rublee, and Gary Bradski. Kornia: differentiable computer vision in PyTorch.https://kornia.github.io, 2024. Accessed 2026-05-02

  13. [13]

    S., Berg, A

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge.International Journal of Computer Vision, 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y

  14. [14]

    libjpeg-turbo.https://libjpeg-turbo.org, 2024

    The libjpeg-turbo Project. libjpeg-turbo.https://libjpeg-turbo.org, 2024. Accessed 2026-05-02. A Generated evidence Every numeric table in the Markdown companion and every paper figure is generated from the platform/library JSON files underoutput/ by tools/paper_assets.py. Each result file stores platform metadata, timed throughput samples, sample standar...