pith. sign in

arxiv: 2509.03472 · v2 · submitted 2025-09-03 · 💻 cs.LG · cs.AI· cs.DC

DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling

Pith reviewed 2026-05-18 19:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DC
keywords differential privacyquantizationDP-SGDdynamic schedulingmodel efficiencyprivacy-preserving MLlow-precision training
0
0 comments X

The pith

Dynamic quantization scheduling reduces accuracy loss from noise amplification in differentially private training while delivering up to 2.21 times higher throughput.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Quantization reduces training compute but causes larger accuracy drops under differential privacy because added noise amplifies quantization variance. DPQuant counters this by rotating which layers get quantized each epoch through probabilistic sampling and by ranking layers with a loss sensitivity estimator that itself satisfies differential privacy. The estimator uses only a negligible fraction of the total privacy budget, so the overall guarantee is preserved. Experiments on ResNet and DenseNet models show the approach reaches near Pareto-optimal accuracy versus compute points and maintains less than 2 percent validation accuracy loss while extending to DP-Adam.

Core claim

Quantization variance grows disproportionately under the noise injection of DP-SGD and DP-Adam; this degradation is reduced by a dynamic schedule that probabilistically rotates the set of quantized layers every epoch and prioritizes quantization decisions via a differentially private loss sensitivity estimator that consumes negligible privacy budget.

What carries the argument

DPQuant dynamic quantization scheduler that combines probabilistic layer rotation across epochs with a differentially private loss sensitivity estimator for layer prioritization.

If this is right

  • DPQuant outperforms static quantization baselines on accuracy-compute trade-offs for ResNet18, ResNet50, and DenseNet121.
  • Theoretical throughput on low-precision hardware improves by up to 2.21 times.
  • Validation accuracy remains within 2 percent of full-precision DP training.
  • The same scheduling gains appear when the method is applied to DP-Adam.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same rotation-plus-private-ranking pattern could be applied to other noise-injected optimizers beyond DP-SGD and DP-Adam.
  • Hardware measurements on actual low-precision accelerators would be needed to confirm the claimed throughput numbers translate to wall-clock savings.
  • Combining the scheduler with complementary techniques such as gradient compression could produce further efficiency gains under fixed privacy budgets.

Load-bearing premise

The differentially private loss sensitivity estimator can reliably identify which layers can be quantized with little quality impact while using only a negligible fraction of the overall privacy budget.

What would settle it

An ablation that disables the loss sensitivity estimator or increases its privacy allocation, after which accuracy-compute curves fall back to the levels of static quantization baselines.

Figures

Figures reproduced from arXiv: 2509.03472 by Gennady Pekhimenko, Nandita Vijaykumar, Renbo Tu, Yubo Gao.

Figure 1
Figure 1. Figure 1: Comparing quantized SGD vs DP-SGD ResNet18 training on the GTSRB dataset [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: DPQUANT system overview Suppose a layer is quantized with probability p, we let gfp to denote its full precision gradients and gquant to be its gradients computed under quantization. By Section 4, quantization incurs additional variance, hence Var(gfp) ≤ Var(gquant). We can write the expected gradient variance as: E (Var(g)) = (1−p) Var(gfp)+p Var(gquant) ≤ Var(gquant) From this it follows that whenever p … view at source ↗
Figure 3
Figure 3. Figure 3: Privacy cost of analysis for ResNet18/GTSRB; performing analysis every 2 epochs [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparing policies generated by DPQUANT to the speed-accuracy Pareto front a certain number of layers are quantized. We refer to the desired number of quantized layers as “computational budget” because it determines the speed and compute resources needed. In [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study, PLS: probabilistic layer selection, LLP: loss-aware layer prioritization In order to better understand the contributions of the two ap￾proaches, we compared our approach (probabilistic layer sam￾pling + loss-aware layer prioritization) with probabilistic layer sampling (PLS) alone. In [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Theoretical speedups for DPQUANT assuming 90% of the layers are quantized. As hardware with support for FP4 MatMuls and Conv2D (e.g., NVIDIA Blackwell) are not yet widely available, we are unable to evaluate the speed ben￾efits of quantization with DPQUANT. Instead, we use estimates from prior work, along with performance statis￾tics published by NVIDIA [37] to esti￾mate speedups. We estimate that FP4 can … view at source ↗
Figure 7
Figure 7. Figure 7: Quantization simulation setup 2This assumption is stated in https://github.com/pytorch/opacus/blob/main/opacus/accountants/analysis/rdp.py 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Runtime decomposition of DP-SGD training [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
read the original abstract

Differentially-Private SGD (DP-SGD) and its adaptive variant DP-Adam are powerful techniques to protect user privacy when using sensitive data to train neural networks. During training, converting model weights and activations into low-precision formats, i.e., quantization, can drastically reduce training times, energy consumption, and cost, and is thus a widely used technique. In this work, we demonstrate for the first time that quantization causes significantly higher accuracy degradation in DP training compared to regular SGD. We observe that this is caused by noise injection, which amplifies quantization variance, leading to disproportionately large accuracy degradation. To address this challenge, we present DPQuant, a dynamic quantization framework that adaptively selects a changing subset of layers to quantize at each epoch. Our method combines two key ideas that effectively reduce quantization variance: (i) probabilistic sampling that rotates which layers are quantized every epoch, and (ii) loss-aware layer prioritization, which uses a differentially private loss sensitivity estimator to identify layers that can be quantized with minimal impact on model quality. This estimator consumes a negligible fraction of the overall privacy budget, preserving DP guarantees. Empirical evaluations on ResNet18, ResNet50, and DenseNet121 across a range of datasets demonstrate that DPQuant consistently outperforms static quantization baselines, achieving near Pareto-optimal accuracy-compute trade-offs and up to $2.21\times$ theoretical throughput improvements on low-precision hardware, with less than 2% drop in validation accuracy. We further show that our framework extends to DP-Adam with similar gains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that quantization induces significantly higher accuracy degradation under DP-SGD than under standard SGD because injected noise amplifies quantization variance. It proposes DPQuant, which dynamically selects a rotating subset of layers for quantization each epoch via (i) probabilistic sampling and (ii) a loss-aware prioritization that employs a differentially private loss sensitivity estimator. The estimator is asserted to consume only a negligible fraction of the total privacy budget. Experiments on ResNet18, ResNet50 and DenseNet121 across multiple datasets report that DPQuant outperforms static quantization baselines, reaches near-Pareto-optimal accuracy-compute trade-offs, delivers up to 2.21× theoretical throughput gains on low-precision hardware, and incurs less than 2 % validation-accuracy drop; similar gains are shown for DP-Adam.

Significance. If the empirical claims and the negligible-budget property of the estimator hold, the work would be a useful practical contribution: it directly tackles the under-studied interaction between DP noise and quantization error and supplies a concrete scheduling mechanism that improves efficiency without materially harming privacy or accuracy. The reported throughput numbers and extension to DP-Adam strengthen the case for deployment on quantized hardware.

major comments (2)
  1. [Abstract / §4 (method)] Abstract and the description of the loss sensitivity estimator: the central claim that the estimator “consumes a negligible fraction of the overall privacy budget” and still produces reliable layer rankings is load-bearing for both the DP guarantee and the reported accuracy gains. No concrete budget split (e.g., ε_estimator / ε_total), no formula for sensitivity computation, and no stability analysis under the added DP noise are supplied; if the estimator’s own noise corrupts the ranking, the dynamic schedule may not outperform static baselines.
  2. [Experimental evaluation] Empirical section: the headline results (<2 % accuracy drop, 2.21× throughput, Pareto optimality) are presented without error bars, exact (ε,δ) values, or ablations that isolate the contribution of the DP estimator versus the probabilistic rotation alone. This leaves open whether the observed gains are statistically robust or dataset/architecture-specific.
minor comments (2)
  1. [Method] Clarify the precise definition and implementation of the probabilistic rotation schedule (e.g., sampling probability per layer, epoch-wise reselection rule) so that the method is reproducible from the text alone.
  2. [Figures and tables] Add error bars or confidence intervals to all accuracy and throughput plots; without them the “near Pareto-optimal” claim is difficult to evaluate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical significance of addressing the interaction between DP noise and quantization. We address each major comment below and commit to revisions that strengthen the clarity and robustness of the claims without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract / §4 (method)] Abstract and the description of the loss sensitivity estimator: the central claim that the estimator “consumes a negligible fraction of the overall privacy budget” and still produces reliable layer rankings is load-bearing for both the DP guarantee and the reported accuracy gains. No concrete budget split (e.g., ε_estimator / ε_total), no formula for sensitivity computation, and no stability analysis under the added DP noise are supplied; if the estimator’s own noise corrupts the ranking, the dynamic schedule may not outperform static baselines.

    Authors: We agree that the current presentation of the loss sensitivity estimator in §4 is high-level and that concrete details are needed to fully substantiate the negligible-budget claim and the reliability of the resulting layer rankings. The manuscript describes the estimator as a DP mechanism with bounded sensitivity, but does not provide an explicit numerical budget allocation, the precise sensitivity formula, or a stability analysis. In the revised manuscript we will add a dedicated paragraph in §4 that (i) states the exact budget split used (e.g., ε_estimator ≤ 0.05 ε_total), (ii) gives the closed-form sensitivity expression, and (iii) reports an empirical stability study comparing private versus non-private layer rankings across the evaluated architectures. These additions will directly address the concern that estimator noise could degrade the dynamic schedule. revision: yes

  2. Referee: [Experimental evaluation] Empirical section: the headline results (<2 % accuracy drop, 2.21× throughput, Pareto optimality) are presented without error bars, exact (ε,δ) values, or ablations that isolate the contribution of the DP estimator versus the probabilistic rotation alone. This leaves open whether the observed gains are statistically robust or dataset/architecture-specific.

    Authors: We concur that the experimental section would benefit from greater statistical transparency and component-wise ablations. The current results report mean accuracy and throughput but omit standard deviations, do not list the precise (ε, δ) tuples for every configuration, and do not isolate the loss-aware prioritization from the probabilistic rotation. In the revision we will (i) add error bars (mean ± std) to all accuracy and throughput figures, (ii) explicitly tabulate the (ε, δ) values used for each dataset–architecture pair, and (iii) include a new ablation table that compares full DPQuant against a variant that uses only probabilistic rotation (i.e., without the DP loss-sensitivity estimator). These changes will demonstrate that both mechanisms contribute to the reported gains and that the improvements are statistically consistent across the evaluated settings. revision: yes

Circularity Check

0 steps flagged

No circularity: DPQuant is an empirical algorithmic proposal with independent evaluations

full rationale

The paper proposes DPQuant as a practical dynamic quantization scheduler for DP-SGD that combines probabilistic layer rotation with a loss-aware prioritization step driven by a DP sensitivity estimator. All performance claims (accuracy, throughput, Pareto trade-offs) are backed by direct empirical measurements on ResNet18/50 and DenseNet121 across datasets, rather than any closed-form derivation or prediction that reduces to the method's own fitted quantities. The statement that the estimator consumes a negligible privacy budget fraction is presented as an implementation choice without any equation that re-derives or re-uses the same sensitivity scores to justify itself. No self-citation chain is load-bearing for the core contribution, and the work remains falsifiable by external replication on the same models and privacy parameters.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that quantization variance is disproportionately amplified by DP noise and on the practical claim that a small privacy budget suffices for the sensitivity estimator; no new mathematical entities or free parameters are introduced beyond standard DP-SGD hyperparameters.

axioms (1)
  • domain assumption Quantization variance is amplified by the noise injection of DP-SGD, causing larger accuracy degradation than in non-private training.
    Stated as the root cause that motivates the dynamic scheduler.

pith-pipeline@v0.9.0 · 5824 in / 1305 out tokens · 36639 ms · 2026-05-18T19:04:42.168011+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 12 internal anchors

  1. [1]

    Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang

    Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS’16. ACM, October 2016. doi: 10.1145/2976749.2978318. URL http://dx.doi.org/10.1145/2976749.2978318

  2. [2]

    2021 , volume =

    AmirAli Abdolrashidi, Lisa Wang, Shivani Agrawal, Jonathan Malmaud, Oleg Rybakov, Chas Leichner, and Lukasz Lew. Pareto-optimal quantized resnet is mostly 4-bit. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), page 3085–3093. IEEE, June 2021. doi: 10.1109/cvprw53098.2021.00345. URL http://dx.doi.org/10. 1109/CVPRW...

  3. [3]

    AMD Instinct ™ MI300X Accelerator Data Sheet: Leading-Edge Accelerator Module for Generative AI, Training, and High-Performance Computing

    Advanced Micro Devices, Inc. AMD Instinct ™ MI300X Accelerator Data Sheet: Leading-Edge Accelerator Module for Generative AI, Training, and High-Performance Computing. Technical report, Advanced Micro Devices, Inc., 2023. URL https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/ data-sheets/amd-instinct-mi300x-data-sheet.pdf . Accessed: 2025-05-13

  4. [4]

    Data types and precision support

    Advanced Micro Devices, Inc. Data types and precision support. https://rocm.docs.amd. com/en/latest/reference/precision-support.html, March 2025. ROCm Documen- tation; Accessed: 2025-05-13

  5. [5]

    L-greco: Layerwise-adaptive gradient compression for efficient and accurate deep learning, 2023

    Mohammadreza Alimohammadi, Ilia Markov, Elias Frantar, and Dan Alistarh. L-greco: Layerwise-adaptive gradient compression for efficient and accurate deep learning, 2023. URL https://arxiv.org/abs/2210.17357

  6. [6]

    QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

    Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan V ojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding, 2017. URL https: //arxiv.org/abs/1610.02132

  7. [7]

    Post-training 4-bit quantization of convolution networks for rapid-deployment, 2019

    Ron Banner, Yury Nahshan, Elad Hoffer, and Daniel Soudry. Post-training 4-bit quantization of convolution networks for rapid-deployment, 2019. URL https://arxiv.org/abs/1810. 05723

  8. [8]

    Efficientqat: Efficient quantization-aware training for large language models.arXiv preprint arXiv:2407.11062, 2024

    Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. Efficientqat: Efficient quantization-aware training for large language models. arXiv preprint arXiv:2407.11062, 2024

  9. [9]

    Accurate neural training with 4-bit matrix multiplications at standard formats, 2024

    Brian Chmiel, Ron Banner, Elad Hoffer, Hilla Ben Yaacov, and Daniel Soudry. Accurate neural training with 4-bit matrix multiplications at standard formats, 2024. URL https: //arxiv.org/abs/2112.10769

  10. [10]

    PACT: Parameterized Clipping Activation for Quantized Neural Networks

    Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks, 2018. URL https://arxiv.org/abs/1805.06085

  11. [11]

    EMNIST: an extension of MNIST to handwritten letters

    Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. Emnist: an extension of mnist to handwritten letters, 2017. URL https://arxiv.org/abs/1702.05373

  12. [12]

    Smith, and Borja Balle

    Soham De, Leonard Berrada, Jamie Hayes, Samuel L. Smith, and Borja Balle. Unlocking high-accuracy differentially private image classification through scale, 2022. URL https: //arxiv.org/abs/2204.13650

  13. [13]

    Cbq: Cross-block quantization for large language models

    Xin Ding, Xiaoyu Liu, Zhijun Tu, Yun Zhang, Wei Li, Jie Hu, Hanting Chen, Yehui Tang, Zhiwei Xiong, Baoqun Yin, et al. Cbq: Cross-block quantization for large language models. arXiv preprint arXiv:2312.07950, 2023

  14. [14]

    Hawq: Hessian aware quantization of neural networks with mixed-precision, 2019

    Zhen Dong, Zhewei Yao, Amir Gholami, Michael Mahoney, and Kurt Keutzer. Hawq: Hessian aware quantization of neural networks with mixed-precision, 2019. URL https://arxiv. org/abs/1905.03696

  15. [15]

    Dynamic differential-privacy preserving sgd, 2022

    Jian Du, Song Li, Xiangyi Chen, Siheng Chen, and Mingyi Hong. Dynamic differential-privacy preserving sgd, 2022. URL https://arxiv.org/abs/2111.00173. 11

  16. [16]

    The algorithmic foundations of differential privacy

    Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foun- dations and Trends in Theoretical Computer Science, 9(3–4):211–407, 2014. doi: 10.1561/ 0400000042. URL https://www.nowpublishers.com/article/Details/TCS-042

  17. [17]

    Learned step size quantization.arXiv preprint arXiv:1902.08153, 2019

    Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S. Modha. Learned step size quantization, 2020. URL https://arxiv.org/ abs/1902.08153

  18. [18]

    Mahoney and Kurt Keutzer , year=

    Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference, 2021. URL https: //arxiv.org/abs/2103.13630

  19. [19]

    Deep Residual Learning for Image Recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015. URL https://arxiv.org/abs/1512.03385

  20. [20]

    Densely Connected Convolutional Networks

    Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks, 2018. URL https://arxiv.org/abs/1608.06993

  21. [21]

    Accurate post training quantization with small calibration sets

    Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, and Daniel Soudry. Accurate post training quantization with small calibration sets. In International Conference on Machine Learning, pages 4466–4475. PMLR, 2021

  22. [22]

    Low-rank compression of neural nets: Learning the rank of each layer

    Yerlan Idelbayev and Miguel A Carreira-Perpinán. Low-rank compression of neural nets: Learning the rank of each layer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8049–8059, 2020

  23. [23]

    Quantization and training of neural networks for efficient integer-arithmetic-only inference, 2017

    Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference, 2017. URL https://arxiv.org/abs/1712. 05877

  24. [24]

    Sergey Ioffe and Christian Szegedy

    Matthew Jagielski, Jonathan Ullman, and Alina Oprea. Auditing differentially private machine learning: How private is private sgd?, 2020. URL https://arxiv.org/abs/2006.07709

  25. [25]

    Accelerating stochastic gradient descent using predictive variance reduction

    Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper_files/paper/2013/file/ ac1dd209cb...

  26. [26]

    Error feedback fixes signsgd and other gradient compression schemes

    Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. In International Conference on Machine Learning, pages 3252–3261. PMLR, 2019

  27. [27]

    Quantizing deep convolutional networks for efficient inference: A whitepaper

    Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018

  28. [28]

    Learning Multiple Layers of Features from Tiny Images

    Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Technical re- port, University of Toronto, April 2009. URL https://www.cs.toronto.edu/~kriz/ learning-features-2009-TR.pdf

  29. [29]

    Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training, 2020. URL https://arxiv. org/abs/1712.01887

  30. [30]

    Torchvision: Pytorch’s computer vision library

    TorchVision maintainers and contributors. Torchvision: Pytorch’s computer vision library. https://github.com/pytorch/vision, 2016

  31. [31]

    An optimization framework for differentially private sparse fine-tuning, 2025

    Mehdi Makni, Kayhan Behdin, Gabriel Afriat, Zheng Xu, Sergei Vassilvitskii, Natalia Pono- mareva, Hussein Hazimeh, and Rahul Mazumder. An optimization framework for differentially private sparse fine-tuning, 2025. URL https://arxiv.org/abs/2503.12822. 12

  32. [32]

    FP8 Formats for Deep Learning

    Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellem- pudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, and Hao Wu. Fp8 formats for deep learning, 2022. URL https://arxiv.org/abs/2209.05433

  33. [33]

    arXiv preprint arXiv:1908.10530 (2019)

    Ilya Mironov, Kunal Talwar, and Li Zhang. Rényi differential privacy of the sampled gaussian mechanism, 2019. URL https://arxiv.org/abs/1908.10530

  34. [34]

    R+r:understanding hyperparameter effects in dp-sgd, 2024

    Felix Morsbach, Jan Reubold, and Thorsten Strufe. R+r:understanding hyperparameter effects in dp-sgd, 2024. URL https://arxiv.org/abs/2411.02051

  35. [35]

    Data-free quantization through weight equalization and bias correction

    Markus Nagel, Mart van Baalen, Tijmen Blankevoort, and Max Welling. Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1325–1334, 2019

  36. [36]

    A White Paper on Neural Network Quantization

    Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295, 2021

  37. [37]

    Nvidia blackwell architecture technical overview, 2024

    NVIDIA Corporation. Nvidia blackwell architecture technical overview, 2024. URL https: //resources.nvidia.com/en-us-blackwell-architecture . Accessed: 2025-05-05

  38. [38]

    Value-aware quantization for training and inference of neural networks

    Eunhyeok Park, Sungjoo Yoo, and Peter Vajda. Value-aware quantization for training and inference of neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 580–595, 2018

  39. [39]

    Brendan McMahan, Sergei Vassilvitskii, Steve Chien, and Abhradeep Guha Thakurta

    Natalia Ponomareva, Hussein Hazimeh, Alex Kurakin, Zheng Xu, Carson Denison, H. Brendan McMahan, Sergei Vassilvitskii, Steve Chien, and Abhradeep Guha Thakurta. How to dp-fy ml: A practical guide to machine learning with differential privacy. Journal of Artificial Intelligence Research, 77:1113–1201, July 2023. ISSN 1076-9757. doi: 10.1613/jair.1.14649. U...

  40. [40]

    Ai hardware cores/accelerators, 2024

    Qualcomm Technologies, Inc. Ai hardware cores/accelerators, 2024. URL https://docs.qualcomm.com/bundle/publicresource/topics/80-63195-1/ AI-hardware-cores-accelerators.html . Accessed: 2025-05-05

  41. [41]

    Optimal clipping and magnitude-aware differentiation for improved quantization- aware training

    Charbel Sakr, Steve Dai, Rangha Venkatesan, Brian Zimmer, William Dally, and Brucek Khailany. Optimal clipping and magnitude-aware differentiation for improved quantization- aware training. In International Conference on Machine Learning, pages 19123–19138. PMLR, 2022

  42. [42]

    Towards scalable distributed training of deep learning on public cloud clusters, 2020

    Shaohuai Shi, Xianhao Zhou, Shutao Song, Xingyao Wang, Zilin Zhu, Xue Huang, Xinan Jiang, Feihu Zhou, Zhenyu Guo, Liqiang Xie, Rui Lan, Xianbin Ouyang, Yan Zhang, Jieqian Wei, Jing Gong, Weiliang Lin, Ping Gao, Peng Meng, Xiaomin Xu, Chenyang Guo, Bo Yang, Zhibo Chen, Yongjian Wu, and Xiaowen Chu. Towards scalable distributed training of deep learning on ...

  43. [43]

    The german traffic sign recognition benchmark: A multi-class classification competition

    Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The german traffic sign recognition benchmark: A multi-class classification competition. InThe 2011 International Joint Conference on Neural Networks, pages 1453–1460, 2011. doi: 10.1109/IJCNN.2011.6033395

  44. [44]

    Stich, Jean-Baptiste Cordonnier, and Martin Jaggi

    Sebastian U. Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified sgd with memory,

  45. [45]

    URL https://arxiv.org/abs/1809.07599

  46. [46]

    Ultra-low precision 4-bit training of deep neural networks

    Xiao Sun, Naigang Wang, Chia-Yu Chen, Jiamin Ni, Ankur Agrawal, Xiaodong Cui, Swagath Venkataramani, Kaoutar El Maghraoui, Vijayalakshmi (Viji) Srinivasan, and Kailash Gopalakrishnan. Ultra-low precision 4-bit training of deep neural networks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Proces...

  47. [47]

    Powersgd: Practical low-rank gradient compression for distributed optimization

    Thijs V ogels, Sai Praneeth Karimireddy, and Martin Jaggi. Powersgd: Practical low-rank gradient compression for distributed optimization. Advances in Neural Information Processing Systems, 32, 2019

  48. [48]

    HAQ: Hardware-Aware Automated Quantization with Mixed Precision

    Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. Haq: Hardware-aware automated quantization with mixed precision, 2019. URL https://arxiv.org/abs/1811.08886

  49. [49]

    TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning

    Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning, 2017. URL https: //arxiv.org/abs/1705.07878

  50. [50]

    Bitwidth-adaptive quantization-aware neural network training: a meta-learning approach

    Jiseok Youn, Jaehun Song, Hyung-Sin Kim, and Saewoong Bahk. Bitwidth-adaptive quantization-aware neural network training: a meta-learning approach. In European Con- ference on Computer Vision, pages 208–224. Springer, 2022

  51. [51]

    Randomized quantization is all you need for differential privacy in federated learning, 2023

    Yeojoon Youn, Zihao Hu, Juba Ziani, and Jacob Abernethy. Randomized quantization is all you need for differential privacy in federated learning, 2023. URL https://arxiv.org/abs/ 2306.11913

  52. [52]

    Opacus: User-friendly differential privacy library in pytorch.arXiv preprint arXiv:2109.12298, 2021

    Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen, Sayan Ghosh, Akash Bharadwaj, Jessica Zhao, Graham Cormode, and Ilya Mironov. Opacus: User-friendly differential privacy library in pytorch, 2022. URL https://arxiv.org/abs/2109.12298

  53. [53]

    On compressing deep models by low rank and sparse decomposition

    Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao. On compressing deep models by low rank and sparse decomposition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7370–7379, 2017

  54. [54]

    DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients

    Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients, 2018. URL https://arxiv.org/abs/1606.06160. 14 A Appendix / supplemental material A.1 Training Hyperparameters While the learning rate might seem too high for regular SGD training, previous res...