arxiv: 2604.14552 · v2 · submitted 2026-04-16 · 💻 cs.PF · cs.AR· cs.LG

Recognition: unknown

DEEP-GAP: Deep-learning Evaluation of Execution Parallelism in GPU Architectural Performance

Kathiravan Palaniappan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:56 UTC · model grok-4.3

classification 💻 cs.PF cs.ARcs.LG

keywords GPU inference performanceT4 L4 comparisonResNet evaluationprecision modesbatch size optimizationthroughput latency tradeoffinference acceleration

0 comments

The pith

NVIDIA L4 GPU achieves up to 4.4x higher inference throughput than T4, peaking at smaller batch sizes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes the performance gap between NVIDIA T4 and L4 GPUs for deep learning inference by running identical workloads under controlled conditions. It shows that lower precision modes deliver large gains and that the L4 reaches its best efficiency at batch sizes of 16 to 32, which helps latency-sensitive tasks, while the T4 stays competitive at large batches. A sympathetic reader would care because datacenters must choose hardware that balances speed, power, and rack space, and these direct comparisons give practical selection rules. The work applies a systematic evaluation method to quantify how architectural changes translate into real inference speedups.

Core claim

Using the DEEP-GAP evaluation on ResNet18, ResNet50, and ResNet101 models in FP32, FP16, and INT8 modes with PyTorch and TensorRT, the L4 GPU delivers up to 4.4x higher throughput than the T4 and reaches peak efficiency at batch sizes between 16 and 32. This improves latency-throughput tradeoffs for latency-sensitive workloads. Reduced precision yields large gains, with INT8 reaching up to 58x throughput over CPU baselines, while the T4 remains competitive for large-batch cases where cost or power efficiency matters.

What carries the argument

DEEP-GAP, a controlled side-by-side benchmarking method that extends prior GPU evaluation techniques to measure inference throughput and efficiency across GPU generations and precision modes.

If this is right

INT8 precision can deliver up to 58x higher throughput than CPU baselines across the tested models.
L4 GPUs improve options for latency-sensitive inference by hitting peak efficiency at smaller batches than prior generations.
T4 GPUs remain suitable for high-volume large-batch workloads when power or cost per unit throughput is the priority.
Deployments can select precision mode, batch size, and GPU generation together to meet specific latency or efficiency targets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Operators could mix L4 and T4 cards within the same rack according to whether workloads are latency-focused or throughput-focused at scale.
The same controlled comparison method could be applied to newer GPU generations or to training workloads to track how architectural changes accumulate.
If the small-batch advantage holds more broadly, inference frameworks might default to lower batch sizes on newer hardware to reduce tail latency.

Load-bearing premise

The three ResNet models together with the selected batch sizes, precisions, and software frameworks stand in for typical production inference workloads.

What would settle it

Repeating the measurements on transformer models or on traces from actual production inference services would show whether the 4.4x throughput edge and the 16-32 batch-size efficiency peak still appear.

Figures

Figures reproduced from arXiv: 2604.14552 by Kathiravan Palaniappan.

**Figure 2.** Figure 2: Benchmark execution structure. For each model [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 4.** Figure 4: Lower precision formats require fewer bits to rep [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 3.** Figure 3: Effect of numerical precision on data representation and parallel execution. Lower precision reduces data size, enabling [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Compute efficiency across precision modes. Lower precision enables more computations per cycle due to reduced [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Thermal behavior, power consumption, and utilization across NVIDIA T4 (top row) and NVIDIA L4 (bottom row) during [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Throughput scaling across batch sizes for ResNet18, ResNet50, and ResNet101. Top row shows NVIDIA T4 results, [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Median latency comparison across ResNet models and precision modes. INT8 consistently achieves the lowest latency, [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Median latency scaling across batch sizes for ResNet18, ResNet50, and ResNet101. Top row shows NVIDIA T4 results [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: GPU memory usage measured via NVML across batch sizes. Top row shows NVIDIA T4 results, bottom row shows [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Latency vs throughput Pareto tradeoff across batch sizes. Top row shows NVIDIA T4 results and bottom row shows [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Throughput speedup relative to FP32 baseline across batch sizes. Top row shows NVIDIA T4 results, bottom row [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

read the original abstract

Modern datacenters increasingly rely on low-power, single-slot inference accelerators to balance performance, energy efficiency, and rack density constraints. The NVIDIA T4 GPU has become widely deployed due to strong performance per watt and mature software support. Its successor, the NVIDIA L4 GPU, introduces improvements in Tensor Core throughput, cache capacity, memory bandwidth, and parallel execution capability. However, limited empirical evidence quantifies the practical inference performance gap between these two generations under controlled and reproducible conditions. This work introduces DEEP-GAP, a systematic evaluation extending the GDEV-AI methodology to GPU inference. Using identical configurations and workloads, we evaluate ResNet18, ResNet50, and ResNet101 across FP32, FP16, and INT8 precision modes using PyTorch and TensorRT. Results show that reduced precision significantly improves performance, with INT8 achieving up to 58x throughput improvement over CPU baselines. L4 achieves up to 4.4x higher throughput than T4 while reaching peak efficiency at smaller batch sizes between 16 and 32, improving latency-throughput tradeoffs for latency-sensitive workloads. T4 remains competitive for large batch workloads where cost or power efficiency is important. DEEP-GAP provides practical guidance for selecting precision modes, batch sizes, and GPU architectures for modern inference deployments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DEEP-GAP, an extension of the GDEV-AI methodology, to empirically compare NVIDIA T4 and L4 GPUs for deep learning inference. Using identical configurations, it evaluates ResNet18, ResNet50, and ResNet101 models under FP32, FP16, and INT8 precisions with both PyTorch and TensorRT frameworks. Key claims include up to 4.4x higher throughput on L4 versus T4, peak efficiency on L4 at batch sizes 16-32 (versus larger batches for T4), and up to 58x throughput gains for INT8 over CPU baselines, with guidance on precision, batch size, and architecture selection for inference workloads.

Significance. If the controlled measurements hold, the work provides practical, reproducible empirical data on generational improvements in low-power inference GPUs, particularly L4's advantages in Tensor Core throughput, cache, bandwidth, and parallelism for latency-sensitive scenarios. The systematic cross-precision and cross-framework design, combined with direct T4/L4 comparisons under matched conditions, offers actionable insights for datacenter deployment decisions where power, density, and latency tradeoffs matter.

major comments (2)

[Abstract and Results] Abstract and Results: The central quantitative claims (4.4x L4 vs. T4 throughput and 58x INT8 vs. CPU) are reported as point values without error bars, standard deviations, number of runs, or exclusion criteria. This directly affects verifiability of the headline measurements that support the latency-throughput tradeoff interpretation.
[Evaluation and Discussion] Evaluation and Discussion: The claim that L4 improves latency-throughput tradeoffs for latency-sensitive workloads rests on batch-size scaling observed only for ResNet18/50/101; no ablation or sensitivity analysis on other workload classes (e.g., transformers or recommendation models) is provided, so the architectural interpretation cannot be separated from model-specific traits.

minor comments (2)

A summary table comparing key T4 and L4 hardware parameters (Tensor Core throughput, memory bandwidth, cache sizes) would help readers contextualize the observed deltas without external lookup.
[Abstract] The abstract should explicitly state the precise model, precision, and batch size at which the 4.4x throughput figure is attained, rather than leaving it as an unqualified maximum.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive recommendation and the detailed feedback on verifiability and scope. We address each major comment below and have revised the manuscript accordingly to improve clarity and precision.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results: The central quantitative claims (4.4x L4 vs. T4 throughput and 58x INT8 vs. CPU) are reported as point values without error bars, standard deviations, number of runs, or exclusion criteria. This directly affects verifiability of the headline measurements that support the latency-throughput tradeoff interpretation.

Authors: We agree that statistical details enhance verifiability. In the revised manuscript we now state that each configuration was executed 10 times under identical conditions, report mean throughput values, and include standard deviations in the results tables and key figures. The abstract and results sections have been updated to reference these details. Variations were small in our controlled environment, but the added information addresses the concern directly. revision: yes
Referee: [Evaluation and Discussion] Evaluation and Discussion: The claim that L4 improves latency-throughput tradeoffs for latency-sensitive workloads rests on batch-size scaling observed only for ResNet18/50/101; no ablation or sensitivity analysis on other workload classes (e.g., transformers or recommendation models) is provided, so the architectural interpretation cannot be separated from model-specific traits.

Authors: We acknowledge the limitation. Our evaluation deliberately focuses on ResNet models as representative CNN inference workloads. In the revised version we have qualified all relevant claims to specify 'for ResNet-based models' and added a dedicated paragraph in the discussion section noting that broader applicability to transformers or recommendation models remains future work. This prevents overgeneralization while retaining the value of the controlled T4/L4 comparison for the evaluated class of workloads. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or fitted predictions

full rationale

The paper reports direct experimental throughput and latency results for ResNet models on T4 and L4 GPUs across precisions and frameworks. No equations, parameter fitting, self-citations as uniqueness theorems, or renamings of known results appear in the provided text. Claims such as 'L4 achieves up to 4.4x higher throughput' and 'peak efficiency at smaller batch sizes' are presented as measured outcomes, not derived quantities. The evaluation methodology (DEEP-GAP extending GDEV-AI) is described as a systematic comparison under identical conditions, with no load-bearing steps that reduce to inputs by construction. Generalization concerns exist but are outside the circularity definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmarking study containing no free parameters, axioms, or invented entities. All reported quantities are direct experimental outcomes.

pith-pipeline@v0.9.0 · 5537 in / 977 out tokens · 41056 ms · 2026-05-10T08:56:27.739234+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 2 canonical work pages

[1]

ImageNet classification with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification with deep convolutional neural networks,” NeurIPS, 2012

2012
[2]

ImageNet: A Large-Scale Hierarchical Image Database,

J. Deng et al., “ImageNet: A Large-Scale Hierarchical Image Database,” CVPR, 2009

2009
[3]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CVPR, 2016

2016
[4]

An Analysis of Deep Neural Network Models for Practical Applications.arXiv2017, arXiv:1605.07678

A. Canziani, A. Paszke, and E. Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications,” arXiv:1605.07678, 2016

work page arXiv 2016
[5]

Available: https://docs.pytorch.org/vision/main/models/resnet.html

PyTorch Contributors,TorchVision ResNet Documentation. Available: https://docs.pytorch.org/vision/main/models/resnet.html
[6]

The Datacenter as a Computer,

L. Barroso, J. Clidaras, and U. Holzle, “The Datacenter as a Computer,” Morgan & Claypool, 2013

2013
[7]

In-Datacenter Performance Analysis of a Tensor Processing Unit,

N. Jouppi et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit,” ISCA, 2017

2017
[8]

Mixed precision training,

P. Micikevicius et al., “Mixed precision training,” ICLR, 2018

2018
[9]

Quantization and training of neural networks for efficient integer-arithmetic-only inference,

B. Jacob et al., “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” CVPR, 2018

2018
[10]

Quantizing deep convolutional networks for efficient inference: A whitepaper

R. Krishnamoorthi, “Quantizing deep convolutional networks for effi- cient inference,” arXiv:1806.08342, 2018

work page arXiv 2018
[11]

Available: https://docs.nvidia.com/deeplearning/tensorrt/latest/index.html

NVIDIA Corporation,NVIDIA TensorRT Documentation. Available: https://docs.nvidia.com/deeplearning/tensorrt/latest/index.html
[12]

Available: https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt- 861/developer-guide/index.html

NVIDIA Corporation,NVIDIA TensorRT Developer Guide. Available: https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt- 861/developer-guide/index.html
[13]

NVIDIA Corporation,NVIDIA Turing GPU Architecture Whitepaper, 2018

2018
[14]

NVIDIA Corporation,NVIDIA Ampere Architecture Whitepaper, 2020

2020
[15]

NVIDIA Corporation,NVIDIA Ada GPU Architecture Whitepaper, 2022

2022
[16]

NVIDIA Corporation,NVIDIA T4 Tensor Core GPU Datasheet, 2018

2018
[17]

NVIDIA Corporation,NVIDIA L4 Tensor Core GPU Datasheet, 2023

2023
[18]

MLPerf Inference Benchmark,

V . J. Reddi et al., “MLPerf Inference Benchmark,” ISCA, 2020

2020
[19]

The Tail at Scale,

J. Dean and L. Barroso, “The Tail at Scale,” Communications of the ACM, 2013

2013
[20]

Roofline: An insightful visual performance model for multicore architectures,

S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for multicore architectures,” Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009

2009
[21]

Better performance at lower occupancy,

V . V olkov, “Better performance at lower occupancy,” Proceedings of GPU Technology Conference (GTC), 2010

2010
[22]

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads,

M. Jeon et al., “Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads,” USENIX ATC, 2019

2019
[23]

Improving Reproducibility in Machine Learning Research,

J. Pineau et al., “Improving Reproducibility in Machine Learning Research,” JMLR, 2021

2021
[24]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding,

S. Han, H. Mao, and W. Dally, “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding,” ICLR, 2016

2016
[25]

GPUSync: A Framework for Real-Time GPU Manage- ment,

Y . Kang et al., “GPUSync: A Framework for Real-Time GPU Manage- ment,” IEEE RTSS, 2017

2017
[26]

Achieving Rapid Response Times in Large Online Services,

J. Dean, “Achieving Rapid Response Times in Large Online Services,” ACM Queue, 2013

2013
[27]

Attention Is All You Need,

A. Vaswani et al., “Attention Is All You Need,” NeurIPS, 2017

2017
[28]

GDEV-AI: A Generalized Evaluation of Deep Learning Inference Scaling and Architectural Saturation,

K. Palaniappan, “GDEV-AI: A Generalized Evaluation of Deep Learning Inference Scaling and Architectural Saturation,” arXiv preprint, 2026

2026
[29]

DEEP-GAP GPU Inference Benchmark,

K. Palaniappan, “DEEP-GAP GPU Inference Benchmark,” GitHub repository, 2026. Available: https://github.com/kpalania1/deep-gap-gpu- inference-benchmark

2026