arxiv: 2605.08731 · v1 · submitted 2026-05-09 · 💻 cs.PF · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Single-Thread JPEG Decoder Benchmarks Mis-Evaluate ML Data Loaders

Vladimir Iglovikov

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:59 UTC · model grok-4.3

classification 💻 cs.PF cs.LG

keywords JPEG decoderPyTorch DataLoaderbenchmark evaluationsingle-thread vs multi-workerdata loadingCPU performanceImageNetML training pipeline

0 comments

The pith

Single-thread JPEG decoder benchmarks produce different rankings than measurements inside PyTorch DataLoaders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

JPEG decode speed matters for machine learning training because data loading can limit how fast models learn. This paper tests whether single-thread microbenchmarks, which are common for choosing decoders, actually predict performance when the decoder runs as part of a PyTorch DataLoader with multiple workers. Across five CPU platforms and twelve decoder libraries, the relative ordering of decoders shifts depending on the measurement method. Decoders that look slow in isolation can perform well in the parallel DataLoader context, and the best choice for zero-skip workloads turns out to be torchvision or simplejpeg. The results also show that conclusions about the number of workers vary by CPU type.

Core claim

The central discovery is that the choice of evaluation protocol alters which decoder appears fastest. Single-thread throughput rankings do not match DataLoader throughput rankings at worker counts of 0, 2, 4, and 8. Specific examples include imageio moving from ninth to top tier on Neoverse V2 and torchvision rising to the top on Zen 4 when tested in the DataLoader. For PyTorch DataLoader use, torchvision achieves the highest mean normalized throughput while simplejpeg has the highest minimum, and OpenCV performs robustly above 90% of the best on every CPU.

What carries the argument

The protocol comparison between isolated single-thread decoding throughput and integrated PyTorch DataLoader throughput with controlled worker counts.

If this is right

DataLoader measurements can elevate decoders that rank low in single-thread tests, such as imageio on Neoverse V2.
torchvision can improve from mid-tier single-thread to top DataLoader performance on certain CPUs like Zen 4.
Worker count recommendations for peak throughput differ between similar CPUs such as Zen 4 and Zen 5.
TensorFlow shows a pronounced single-thread slowdown on ARM CPUs compared to other platforms.
torchvision and simplejpeg form the strongest zero-skip tier for PyTorch DataLoader workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners should re-evaluate decoder choices using their specific DataLoader setup rather than single-thread published numbers.
Decoder performance in loaders may depend on how each implementation handles memory access and thread coordination with the DataLoader.
Comparable evaluation gaps could exist for other common ML data operations like image augmentations or non-JPEG formats.
Adopting integrated benchmarks might improve training efficiency by avoiding suboptimal decoder selections on target hardware.

Load-bearing premise

The assumption that single-thread throughput is a reliable proxy for performance inside a multi-worker DataLoader without accounting for interactions with the loader's implementation or hardware specifics.

What would settle it

A follow-up test on a sixth CPU architecture where single-thread rankings match DataLoader rankings for all decoders, or the discovery of a decoder that maintains its relative position across both evaluation methods on the existing platforms.

Figures

Figures reproduced from arXiv: 2605.08731 by Vladimir Iglovikov.

**Figure 2.** Figure 2: Worker-count scaling differs between AMD generations. Bars show percent throughput change [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: TensorFlow JPEG decode shows a large ARM penalty. Bars show TensorFlow single-thread [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: DataLoader speed and observed JPEG robustness. Bar length is each decoder’s mean peak [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

JPEG decode is routine ML infrastructure, but Python decoder choices are often justified by single-process, single-thread microbenchmarks. We audit this evaluation assumption with twelve Python-accessible JPEG decode paths on five matched 16 vCPU Google Cloud CPUs: Intel Emerald Rapids, AMD Zen 4, AMD Zen 5, ARM Neoverse V2, and ARM Neoverse N1. ImageNet validation is the workload, not a new dataset contribution: each run decodes the full 50,000-image split from memory and reports single-thread throughput for all decoders, PyTorch DataLoader throughput for eligible decoders at worker counts {0,2,4,8}, and decoder skip behavior. The evaluation protocol changes the supported conclusion. On Neoverse V2, imageio is ninth in single-thread throughput yet lands in the top DataLoader tier with torchvision; on Zen 4, torchvision rises from seventh single-thread to the top measured DataLoader tier; on Neoverse N1, imagecodecs is the single-thread leader but fourth at peak DataLoader throughput. We also find that worker-count conclusions differ between Zen 4 and Zen 5, TensorFlow has a large single-thread ARM penalty, and strict libjpeg-turbo-family wrappers reject the same rare ImageNet JPEG. For PyTorch DataLoader workloads, torchvision and simplejpeg form the strongest measured zero-skip tier: torchvision has the highest mean normalized throughput, while simplejpeg has the highest minimum. OpenCV remains a robust general-purpose fallback above 90% of the platform-local winner on every tested CPU. We release raw JSON, generated tables/figures, and an executable local/cloud benchmark framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Single-thread JPEG benchmarks mislead DataLoader choices on several CPUs, with concrete ranking reversals shown in reproducible measurements.

read the letter

The main thing to know is that single-thread JPEG decoder benchmarks do not reliably predict which library will perform best inside a PyTorch DataLoader, and this paper demonstrates the mismatch with specific cross-CPU examples. On Neoverse V2, imageio ranks ninth in single-thread throughput but reaches the top DataLoader tier alongside torchvision; on Zen 4, torchvision climbs from seventh to the leading measured tier; on Neoverse N1, imagecodecs leads single-thread but falls to fourth at peak DataLoader throughput. They also observe that optimal worker counts differ between Zen 4 and Zen 5, TensorFlow carries a large single-thread penalty on ARM, and some strict libjpeg-turbo wrappers reject the same rare ImageNet files. The workload is the standard 50k-image ImageNet validation set decoded from memory on five matched 16 vCPU instances, with worker counts fixed at 0, 2, 4, and 8. They release raw JSON, tables, figures, and an executable benchmark framework, which supports direct checking of the numbers. This extends prior single-thread comparisons by adding the integrated DataLoader setting and multi-architecture view, and the empirical protocol avoids fitting or circular definitions. The central claim holds because the ranking shifts are observable in the controlled comparison rather than derived from assumptions. One limited soft spot is that the DataLoader results necessarily reflect how each decoder integrates with PyTorch's worker and memory handling, so the shifts may involve implementation details beyond pure thread count; however, that integration is exactly the realistic workload the paper targets, so it does not weaken the methodological point. Minor details such as precise timing instrumentation or library version pinning would be worth confirming in the full text, but the released artifacts reduce that concern. This paper is for ML systems practitioners who choose or optimize data loaders for computer vision training pipelines, especially those running on recent AMD or ARM hardware. A reader tuning throughput for large-scale training would get direct value from the platform-specific rankings and the caution against single-thread proxies. It deserves a serious referee because the measurements are grounded and the practical implication is clear. I would recommend sending it to peer review.

Referee Report

0 major / 3 minor

Summary. The paper claims that single-thread JPEG decoder microbenchmarks are misleading for ML data loader selection because performance rankings reverse under realistic PyTorch DataLoader workloads (workers in {0,2,4,8}) on the ImageNet validation set. It demonstrates this across five 16-vCPU platforms with twelve Python JPEG paths, reports concrete reversals (e.g., imageio ninth single-thread but top-tier DataLoader on Neoverse V2; torchvision seventh single-thread but top DataLoader on Zen 4), notes platform-specific behaviors including worker-count differences between Zen 4 and Zen 5, and releases raw JSON, tables, figures, and an executable benchmark framework.

Significance. If the empirical results hold, the work is significant for ML systems and performance evaluation: it supplies a workload-matched, reproducible protocol on a standard dataset rather than synthetic microbenchmarks, directly affecting decoder choice in training pipelines. The release of artifacts, explicit skip reporting, and cross-architecture coverage are strengths that enable verification and adoption of better evaluation practices.

minor comments (3)

The abstract states that 'strict libjpeg-turbo-family wrappers reject the same rare ImageNet JPEG' and reports decoder skips; the paper should include an explicit table or subsection listing per-decoder skip counts and how skips are excluded from throughput calculations to ensure the DataLoader numbers are directly comparable to single-thread results.
The methodology description should specify the exact PyTorch DataLoader parameters (prefetch_factor, pin_memory, persistent_workers) and timing method (e.g., wall-clock per batch or per image) used for the multi-worker measurements, as these choices can interact with decoder and memory behavior.
Figure or table captions for the ranking reversals should include the normalized throughput values (or at least the top-three and bottom-three) so readers can assess the magnitude of the observed shifts without needing the raw JSON.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment, recognition of the work's significance for ML systems evaluation, and recommendation for minor revision. The emphasis on reproducible, workload-matched protocols and artifact release is appreciated.

Circularity Check

0 steps flagged

No significant circularity in empirical benchmarking study

full rationale

This paper is a direct empirical benchmarking study that measures and compares JPEG decoder throughputs under single-thread microbenchmarks versus PyTorch DataLoader multi-worker conditions on five CPUs using the fixed ImageNet validation workload. All reported rankings, throughput numbers, and protocol-difference conclusions rest on concrete, reproducible runs with released raw JSON data and an executable framework; there are no equations, fitted parameters, derivations, or self-citation chains that reduce any result to a quantity defined by the paper's own inputs. The central claim that evaluation protocol changes supported conclusions is therefore independently verifiable from the measurements themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on direct throughput measurements of standard libraries against the public ImageNet validation set rather than on fitted models or new theoretical constructs.

axioms (1)

domain assumption ImageNet validation set serves as a representative workload for JPEG decoding in ML training
Explicitly used as the workload without introducing a new dataset.

pith-pipeline@v0.9.0 · 5602 in / 1335 out tokens · 84549 ms · 2026-05-12T01:59:52.120853+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The evaluation protocol changes the supported conclusion. On Neoverse V2, imageio is ninth in single-thread throughput yet lands in the top DataLoader tier with torchvision...
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We compare the common single-thread decoder protocol against a PyTorch DataLoader protocol on the same workload...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

TensorFlow.https://www.tensorflow.org, 2024

Google Brain Team. TensorFlow.https://www.tensorflow.org, 2024. Accessed 2026-05-02

work page 2024
[2]

Google Cloud compute engine machine families (c4 / c4d / c4a / t2a documentation)

Google Cloud. Google Cloud compute engine machine families (c4 / c4d / c4a / t2a documentation). https://cloud.google.com/compute/docs/machine-types, 2026. Accessed 2026-05-02

work page 2026
[3]

FFCV: Accelerating training by removing data bottlenecks

Guillaume Leclerc, Andrew Ilyas, Logan Engstrom, Sung Min Park, Hadi Salman, and Aleksander Mądry. FFCV: Accelerating training by removing data bottlenecks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12011–12020, 2023

work page 2023
[4]

Webdataset: a PyTorch dataset (WebDataset) designed for streaming training.https://github.com/webdataset/webdataset, 2021

Bob McElrath and Thomas Breuel. Webdataset: a PyTorch dataset (WebDataset) designed for streaming training.https://github.com/webdataset/webdataset, 2021. Accessed 2026-05-02

work page 2021
[5]

NVIDIA DALI: GPU-accelerated data loading and image augmentation.https: //developer.nvidia.com/dali, 2024

NVIDIA Corporation. NVIDIA DALI: GPU-accelerated data loading and image augmentation.https: //developer.nvidia.com/dali, 2024. Accessed 2026-05-02

work page 2024
[6]

nvJPEG: GPU-accelerated JPEG decode

NVIDIA Corporation. nvJPEG: GPU-accelerated JPEG decode. https://developer.nvidia.com/ nvjpeg, 2024. Accessed 2026-05-02

work page 2024
[7]

Open source computer vision library (OpenCV).https://opencv.org, 2024

OpenCV Team. Open source computer vision library (OpenCV).https://opencv.org, 2024. Accessed 2026-05-02

work page 2024
[8]

Pillow: the friendly PIL fork.https://pillow.readthedocs.io/en/stable/, 2024

Pillow Developers. Pillow: the friendly PIL fork.https://pillow.readthedocs.io/en/stable/, 2024. Accessed 2026-05-02

work page 2024
[9]

PyTorch.https://pytorch.org, 2024

PyTorch Team. PyTorch.https://pytorch.org, 2024. Accessed 2026-05-02

work page 2024
[10]

Torchdata.https://github.com/pytorch/data, 2024

PyTorch Team. Torchdata.https://github.com/pytorch/data, 2024. Accessed 2026-05-02

work page 2024
[11]

torchvision.https://pytorch.org/vision, 2024

PyTorch Team. torchvision.https://pytorch.org/vision, 2024. Accessed 2026-05-02

work page 2024
[12]

Kornia: differentiable computer vision in PyTorch.https://kornia.github.io, 2024

Edgar Riba, Dmytro Mishkin, Daniel Ponsa, Ethan Rublee, and Gary Bradski. Kornia: differentiable computer vision in PyTorch.https://kornia.github.io, 2024. Accessed 2026-05-02

work page 2024
[13]

S., Berg, A

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge.International Journal of Computer Vision, 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y

work page doi:10.1007/s11263-015-0816-y 2015
[14]

libjpeg-turbo.https://libjpeg-turbo.org, 2024

The libjpeg-turbo Project. libjpeg-turbo.https://libjpeg-turbo.org, 2024. Accessed 2026-05-02. A Generated evidence Every numeric table in the Markdown companion and every paper figure is generated from the platform/library JSON files underoutput/ by tools/paper_assets.py. Each result file stores platform metadata, timed throughput samples, sample standar...

work page 2024