DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling

Gennady Pekhimenko; Nandita Vijaykumar; Renbo Tu; Yubo Gao

arxiv: 2509.03472 · v2 · submitted 2025-09-03 · 💻 cs.LG · cs.AI· cs.DC

DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling

Yubo Gao , Renbo Tu , Gennady Pekhimenko , Nandita Vijaykumar This is my paper

Pith reviewed 2026-05-18 19:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DC

keywords differential privacyquantizationDP-SGDdynamic schedulingmodel efficiencyprivacy-preserving MLlow-precision training

0 comments

The pith

Dynamic quantization scheduling reduces accuracy loss from noise amplification in differentially private training while delivering up to 2.21 times higher throughput.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Quantization reduces training compute but causes larger accuracy drops under differential privacy because added noise amplifies quantization variance. DPQuant counters this by rotating which layers get quantized each epoch through probabilistic sampling and by ranking layers with a loss sensitivity estimator that itself satisfies differential privacy. The estimator uses only a negligible fraction of the total privacy budget, so the overall guarantee is preserved. Experiments on ResNet and DenseNet models show the approach reaches near Pareto-optimal accuracy versus compute points and maintains less than 2 percent validation accuracy loss while extending to DP-Adam.

Core claim

Quantization variance grows disproportionately under the noise injection of DP-SGD and DP-Adam; this degradation is reduced by a dynamic schedule that probabilistically rotates the set of quantized layers every epoch and prioritizes quantization decisions via a differentially private loss sensitivity estimator that consumes negligible privacy budget.

What carries the argument

DPQuant dynamic quantization scheduler that combines probabilistic layer rotation across epochs with a differentially private loss sensitivity estimator for layer prioritization.

If this is right

DPQuant outperforms static quantization baselines on accuracy-compute trade-offs for ResNet18, ResNet50, and DenseNet121.
Theoretical throughput on low-precision hardware improves by up to 2.21 times.
Validation accuracy remains within 2 percent of full-precision DP training.
The same scheduling gains appear when the method is applied to DP-Adam.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rotation-plus-private-ranking pattern could be applied to other noise-injected optimizers beyond DP-SGD and DP-Adam.
Hardware measurements on actual low-precision accelerators would be needed to confirm the claimed throughput numbers translate to wall-clock savings.
Combining the scheduler with complementary techniques such as gradient compression could produce further efficiency gains under fixed privacy budgets.

Load-bearing premise

The differentially private loss sensitivity estimator can reliably identify which layers can be quantized with little quality impact while using only a negligible fraction of the overall privacy budget.

What would settle it

An ablation that disables the loss sensitivity estimator or increases its privacy allocation, after which accuracy-compute curves fall back to the levels of static quantization baselines.

Figures

Figures reproduced from arXiv: 2509.03472 by Gennady Pekhimenko, Nandita Vijaykumar, Renbo Tu, Yubo Gao.

**Figure 2.** Figure 2: DPQUANT system overview Suppose a layer is quantized with probability p, we let gfp to denote its full precision gradients and gquant to be its gradients computed under quantization. By Section 4, quantization incurs additional variance, hence Var(gfp) ≤ Var(gquant). We can write the expected gradient variance as: E (Var(g)) = (1−p) Var(gfp)+p Var(gquant) ≤ Var(gquant) From this it follows that whenever p … view at source ↗

**Figure 3.** Figure 3: Privacy cost of analysis for ResNet18/GTSRB; performing analysis every 2 epochs [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Comparing policies generated by DPQUANT to the speed-accuracy Pareto front a certain number of layers are quantized. We refer to the desired number of quantized layers as “computational budget” because it determines the speed and compute resources needed. In [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation study, PLS: probabilistic layer selection, LLP: loss-aware layer prioritization In order to better understand the contributions of the two approaches, we compared our approach (probabilistic layer sampling + loss-aware layer prioritization) with probabilistic layer sampling (PLS) alone. In [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Theoretical speedups for DPQUANT assuming 90% of the layers are quantized. As hardware with support for FP4 MatMuls and Conv2D (e.g., NVIDIA Blackwell) are not yet widely available, we are unable to evaluate the speed benefits of quantization with DPQUANT. Instead, we use estimates from prior work, along with performance statistics published by NVIDIA [37] to estimate speedups. We estimate that FP4 can … view at source ↗

**Figure 7.** Figure 7: Quantization simulation setup 2This assumption is stated in https://github.com/pytorch/opacus/blob/main/opacus/accountants/analysis/rdp.py 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Runtime decomposition of DP-SGD training [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Differentially-Private SGD (DP-SGD) and its adaptive variant DP-Adam are powerful techniques to protect user privacy when using sensitive data to train neural networks. During training, converting model weights and activations into low-precision formats, i.e., quantization, can drastically reduce training times, energy consumption, and cost, and is thus a widely used technique. In this work, we demonstrate for the first time that quantization causes significantly higher accuracy degradation in DP training compared to regular SGD. We observe that this is caused by noise injection, which amplifies quantization variance, leading to disproportionately large accuracy degradation. To address this challenge, we present DPQuant, a dynamic quantization framework that adaptively selects a changing subset of layers to quantize at each epoch. Our method combines two key ideas that effectively reduce quantization variance: (i) probabilistic sampling that rotates which layers are quantized every epoch, and (ii) loss-aware layer prioritization, which uses a differentially private loss sensitivity estimator to identify layers that can be quantized with minimal impact on model quality. This estimator consumes a negligible fraction of the overall privacy budget, preserving DP guarantees. Empirical evaluations on ResNet18, ResNet50, and DenseNet121 across a range of datasets demonstrate that DPQuant consistently outperforms static quantization baselines, achieving near Pareto-optimal accuracy-compute trade-offs and up to $2.21\times$ theoretical throughput improvements on low-precision hardware, with less than 2% drop in validation accuracy. We further show that our framework extends to DP-Adam with similar gains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DPQuant shows quantization hurts DP training more due to noise amplification and offers a dynamic fix via rotation plus DP loss sensitivity prioritization, but the estimator's reliability under its own noise is the untested hinge.

read the letter

The key thing to know is that DP training takes a bigger hit from quantization than regular training does, because the privacy noise magnifies the errors from low-precision weights and activations. DPQuant addresses this by rotating which layers get quantized each epoch in a probabilistic way and by using a loss sensitivity estimator that runs under DP to pick the least damaging layers. This combination is presented as new, and the empirical results back it up reasonably well. They test on ResNet18, ResNet50, and DenseNet121 with various datasets, reporting better accuracy versus compute curves than static quantization and speedups up to 2.21 times on low-precision hardware, all with under 2% accuracy drop. The fact that it also works for DP-Adam is a nice extension. The approach is algorithmic and doesn't rely on fitting parameters in a circular way. The soft spot is the loss sensitivity estimator. It has to rank layers accurately even though it adds its own DP noise, and it must use only a small slice of the total privacy budget. If the noise makes the rankings inconsistent, the prioritization won't deliver the claimed gains over simpler rotation. The abstract doesn't spell out the exact sensitivity calculation or show how stable the rankings are, and the reported results don't include error bars or detailed ablations on this component. That leaves some uncertainty about whether the improvements are robust. This paper is for researchers focused on making differentially private training practical on hardware with limited precision support. Readers who need concrete ways to trade off privacy, accuracy, and speed in ML training will find usable ideas here. It has enough novelty and experimental grounding to merit a serious referee. I would recommend putting it through peer review, with specific requests for more analysis on the estimator's noise tolerance and additional statistical details on the results.

Referee Report

2 major / 2 minor

Summary. The paper claims that quantization induces significantly higher accuracy degradation under DP-SGD than under standard SGD because injected noise amplifies quantization variance. It proposes DPQuant, which dynamically selects a rotating subset of layers for quantization each epoch via (i) probabilistic sampling and (ii) a loss-aware prioritization that employs a differentially private loss sensitivity estimator. The estimator is asserted to consume only a negligible fraction of the total privacy budget. Experiments on ResNet18, ResNet50 and DenseNet121 across multiple datasets report that DPQuant outperforms static quantization baselines, reaches near-Pareto-optimal accuracy-compute trade-offs, delivers up to 2.21× theoretical throughput gains on low-precision hardware, and incurs less than 2 % validation-accuracy drop; similar gains are shown for DP-Adam.

Significance. If the empirical claims and the negligible-budget property of the estimator hold, the work would be a useful practical contribution: it directly tackles the under-studied interaction between DP noise and quantization error and supplies a concrete scheduling mechanism that improves efficiency without materially harming privacy or accuracy. The reported throughput numbers and extension to DP-Adam strengthen the case for deployment on quantized hardware.

major comments (2)

[Abstract / §4 (method)] Abstract and the description of the loss sensitivity estimator: the central claim that the estimator “consumes a negligible fraction of the overall privacy budget” and still produces reliable layer rankings is load-bearing for both the DP guarantee and the reported accuracy gains. No concrete budget split (e.g., ε_estimator / ε_total), no formula for sensitivity computation, and no stability analysis under the added DP noise are supplied; if the estimator’s own noise corrupts the ranking, the dynamic schedule may not outperform static baselines.
[Experimental evaluation] Empirical section: the headline results (<2 % accuracy drop, 2.21× throughput, Pareto optimality) are presented without error bars, exact (ε,δ) values, or ablations that isolate the contribution of the DP estimator versus the probabilistic rotation alone. This leaves open whether the observed gains are statistically robust or dataset/architecture-specific.

minor comments (2)

[Method] Clarify the precise definition and implementation of the probabilistic rotation schedule (e.g., sampling probability per layer, epoch-wise reselection rule) so that the method is reproducible from the text alone.
[Figures and tables] Add error bars or confidence intervals to all accuracy and throughput plots; without them the “near Pareto-optimal” claim is difficult to evaluate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical significance of addressing the interaction between DP noise and quantization. We address each major comment below and commit to revisions that strengthen the clarity and robustness of the claims without altering the core contributions.

read point-by-point responses

Referee: [Abstract / §4 (method)] Abstract and the description of the loss sensitivity estimator: the central claim that the estimator “consumes a negligible fraction of the overall privacy budget” and still produces reliable layer rankings is load-bearing for both the DP guarantee and the reported accuracy gains. No concrete budget split (e.g., ε_estimator / ε_total), no formula for sensitivity computation, and no stability analysis under the added DP noise are supplied; if the estimator’s own noise corrupts the ranking, the dynamic schedule may not outperform static baselines.

Authors: We agree that the current presentation of the loss sensitivity estimator in §4 is high-level and that concrete details are needed to fully substantiate the negligible-budget claim and the reliability of the resulting layer rankings. The manuscript describes the estimator as a DP mechanism with bounded sensitivity, but does not provide an explicit numerical budget allocation, the precise sensitivity formula, or a stability analysis. In the revised manuscript we will add a dedicated paragraph in §4 that (i) states the exact budget split used (e.g., ε_estimator ≤ 0.05 ε_total), (ii) gives the closed-form sensitivity expression, and (iii) reports an empirical stability study comparing private versus non-private layer rankings across the evaluated architectures. These additions will directly address the concern that estimator noise could degrade the dynamic schedule. revision: yes
Referee: [Experimental evaluation] Empirical section: the headline results (<2 % accuracy drop, 2.21× throughput, Pareto optimality) are presented without error bars, exact (ε,δ) values, or ablations that isolate the contribution of the DP estimator versus the probabilistic rotation alone. This leaves open whether the observed gains are statistically robust or dataset/architecture-specific.

Authors: We concur that the experimental section would benefit from greater statistical transparency and component-wise ablations. The current results report mean accuracy and throughput but omit standard deviations, do not list the precise (ε, δ) tuples for every configuration, and do not isolate the loss-aware prioritization from the probabilistic rotation. In the revision we will (i) add error bars (mean ± std) to all accuracy and throughput figures, (ii) explicitly tabulate the (ε, δ) values used for each dataset–architecture pair, and (iii) include a new ablation table that compares full DPQuant against a variant that uses only probabilistic rotation (i.e., without the DP loss-sensitivity estimator). These changes will demonstrate that both mechanisms contribute to the reported gains and that the improvements are statistically consistent across the evaluated settings. revision: yes

Circularity Check

0 steps flagged

No circularity: DPQuant is an empirical algorithmic proposal with independent evaluations

full rationale

The paper proposes DPQuant as a practical dynamic quantization scheduler for DP-SGD that combines probabilistic layer rotation with a loss-aware prioritization step driven by a DP sensitivity estimator. All performance claims (accuracy, throughput, Pareto trade-offs) are backed by direct empirical measurements on ResNet18/50 and DenseNet121 across datasets, rather than any closed-form derivation or prediction that reduces to the method's own fitted quantities. The statement that the estimator consumes a negligible privacy budget fraction is presented as an implementation choice without any equation that re-derives or re-uses the same sensitivity scores to justify itself. No self-citation chain is load-bearing for the core contribution, and the work remains falsifiable by external replication on the same models and privacy parameters.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that quantization variance is disproportionately amplified by DP noise and on the practical claim that a small privacy budget suffices for the sensitivity estimator; no new mathematical entities or free parameters are introduced beyond standard DP-SGD hyperparameters.

axioms (1)

domain assumption Quantization variance is amplified by the noise injection of DP-SGD, causing larger accuracy degradation than in non-private training.
Stated as the root cause that motivates the dynamic scheduler.

pith-pipeline@v0.9.0 · 5824 in / 1305 out tokens · 36639 ms · 2026-05-18T19:04:42.168011+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

probabilistic sampling that rotates which layers are quantized every epoch, and (ii) loss-aware layer prioritization, which uses a differentially private loss sensitivity estimator
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Algorithm 1 COMPUTE LOSS IMPACT … Sampled Gaussian Mechanism … UPDATE PRIVACY

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 12 internal anchors

[1]

Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang

Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS’16. ACM, October 2016. doi: 10.1145/2976749.2978318. URL http://dx.doi.org/10.1145/2976749.2978318

work page doi:10.1145/2976749.2978318 2016
[2]

2021 , volume =

AmirAli Abdolrashidi, Lisa Wang, Shivani Agrawal, Jonathan Malmaud, Oleg Rybakov, Chas Leichner, and Lukasz Lew. Pareto-optimal quantized resnet is mostly 4-bit. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), page 3085–3093. IEEE, June 2021. doi: 10.1109/cvprw53098.2021.00345. URL http://dx.doi.org/10. 1109/CVPRW...

work page doi:10.1109/cvprw53098.2021.00345 2021
[3]

AMD Instinct ™ MI300X Accelerator Data Sheet: Leading-Edge Accelerator Module for Generative AI, Training, and High-Performance Computing

Advanced Micro Devices, Inc. AMD Instinct ™ MI300X Accelerator Data Sheet: Leading-Edge Accelerator Module for Generative AI, Training, and High-Performance Computing. Technical report, Advanced Micro Devices, Inc., 2023. URL https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/ data-sheets/amd-instinct-mi300x-data-sheet.pdf . Accessed: 2025-05-13

work page 2023
[4]

Data types and precision support

Advanced Micro Devices, Inc. Data types and precision support. https://rocm.docs.amd. com/en/latest/reference/precision-support.html, March 2025. ROCm Documen- tation; Accessed: 2025-05-13

work page 2025
[5]

L-greco: Layerwise-adaptive gradient compression for efficient and accurate deep learning, 2023

Mohammadreza Alimohammadi, Ilia Markov, Elias Frantar, and Dan Alistarh. L-greco: Layerwise-adaptive gradient compression for efficient and accurate deep learning, 2023. URL https://arxiv.org/abs/2210.17357

work page arXiv 2023
[6]

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan V ojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding, 2017. URL https: //arxiv.org/abs/1610.02132

work page internal anchor Pith review Pith/arXiv arXiv 2017
[7]

Post-training 4-bit quantization of convolution networks for rapid-deployment, 2019

Ron Banner, Yury Nahshan, Elad Hoffer, and Daniel Soudry. Post-training 4-bit quantization of convolution networks for rapid-deployment, 2019. URL https://arxiv.org/abs/1810. 05723

work page 2019
[8]

Efficientqat: Efficient quantization-aware training for large language models.arXiv preprint arXiv:2407.11062, 2024

Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. Efficientqat: Efficient quantization-aware training for large language models. arXiv preprint arXiv:2407.11062, 2024

work page arXiv 2024
[9]

Accurate neural training with 4-bit matrix multiplications at standard formats, 2024

Brian Chmiel, Ron Banner, Elad Hoffer, Hilla Ben Yaacov, and Daniel Soudry. Accurate neural training with 4-bit matrix multiplications at standard formats, 2024. URL https: //arxiv.org/abs/2112.10769

work page arXiv 2024
[10]

PACT: Parameterized Clipping Activation for Quantized Neural Networks

Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks, 2018. URL https://arxiv.org/abs/1805.06085

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

EMNIST: an extension of MNIST to handwritten letters

Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. Emnist: an extension of mnist to handwritten letters, 2017. URL https://arxiv.org/abs/1702.05373

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

Smith, and Borja Balle

Soham De, Leonard Berrada, Jamie Hayes, Samuel L. Smith, and Borja Balle. Unlocking high-accuracy differentially private image classification through scale, 2022. URL https: //arxiv.org/abs/2204.13650

work page arXiv 2022
[13]

Cbq: Cross-block quantization for large language models

Xin Ding, Xiaoyu Liu, Zhijun Tu, Yun Zhang, Wei Li, Jie Hu, Hanting Chen, Yehui Tang, Zhiwei Xiong, Baoqun Yin, et al. Cbq: Cross-block quantization for large language models. arXiv preprint arXiv:2312.07950, 2023

work page arXiv 2023
[14]

Hawq: Hessian aware quantization of neural networks with mixed-precision, 2019

Zhen Dong, Zhewei Yao, Amir Gholami, Michael Mahoney, and Kurt Keutzer. Hawq: Hessian aware quantization of neural networks with mixed-precision, 2019. URL https://arxiv. org/abs/1905.03696

work page arXiv 2019
[15]

Dynamic differential-privacy preserving sgd, 2022

Jian Du, Song Li, Xiangyi Chen, Siheng Chen, and Mingyi Hong. Dynamic differential-privacy preserving sgd, 2022. URL https://arxiv.org/abs/2111.00173. 11

work page arXiv 2022
[16]

The algorithmic foundations of differential privacy

Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foun- dations and Trends in Theoretical Computer Science, 9(3–4):211–407, 2014. doi: 10.1561/ 0400000042. URL https://www.nowpublishers.com/article/Details/TCS-042

work page 2014
[17]

Learned step size quantization.arXiv preprint arXiv:1902.08153, 2019

Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S. Modha. Learned step size quantization, 2020. URL https://arxiv.org/ abs/1902.08153

work page arXiv 2020
[18]

Mahoney and Kurt Keutzer , year=

Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference, 2021. URL https: //arxiv.org/abs/2103.13630

work page arXiv 2021
[19]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015. URL https://arxiv.org/abs/1512.03385

work page internal anchor Pith review Pith/arXiv arXiv 2015
[20]

Densely Connected Convolutional Networks

Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks, 2018. URL https://arxiv.org/abs/1608.06993

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

Accurate post training quantization with small calibration sets

Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, and Daniel Soudry. Accurate post training quantization with small calibration sets. In International Conference on Machine Learning, pages 4466–4475. PMLR, 2021

work page 2021
[22]

Low-rank compression of neural nets: Learning the rank of each layer

Yerlan Idelbayev and Miguel A Carreira-Perpinán. Low-rank compression of neural nets: Learning the rank of each layer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8049–8059, 2020

work page 2020
[23]

Quantization and training of neural networks for efficient integer-arithmetic-only inference, 2017

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference, 2017. URL https://arxiv.org/abs/1712. 05877

work page 2017
[24]

Sergey Ioffe and Christian Szegedy

Matthew Jagielski, Jonathan Ullman, and Alina Oprea. Auditing differentially private machine learning: How private is private sgd?, 2020. URL https://arxiv.org/abs/2006.07709

work page arXiv 2020
[25]

Accelerating stochastic gradient descent using predictive variance reduction

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper_files/paper/2013/file/ ac1dd209cb...

work page 2013
[26]

Error feedback fixes signsgd and other gradient compression schemes

Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. In International Conference on Machine Learning, pages 3252–3261. PMLR, 2019

work page 2019
[27]

Quantizing deep convolutional networks for efficient inference: A whitepaper

Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Learning Multiple Layers of Features from Tiny Images

Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Technical re- port, University of Toronto, April 2009. URL https://www.cs.toronto.edu/~kriz/ learning-features-2009-TR.pdf

work page 2009
[29]

Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training, 2020. URL https://arxiv. org/abs/1712.01887

work page arXiv 2020
[30]

Torchvision: Pytorch’s computer vision library

TorchVision maintainers and contributors. Torchvision: Pytorch’s computer vision library. https://github.com/pytorch/vision, 2016

work page 2016
[31]

An optimization framework for differentially private sparse fine-tuning, 2025

Mehdi Makni, Kayhan Behdin, Gabriel Afriat, Zheng Xu, Sergei Vassilvitskii, Natalia Pono- mareva, Hussein Hazimeh, and Rahul Mazumder. An optimization framework for differentially private sparse fine-tuning, 2025. URL https://arxiv.org/abs/2503.12822. 12

work page arXiv 2025
[32]

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellem- pudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, and Hao Wu. Fp8 formats for deep learning, 2022. URL https://arxiv.org/abs/2209.05433

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

arXiv preprint arXiv:1908.10530 (2019)

Ilya Mironov, Kunal Talwar, and Li Zhang. Rényi differential privacy of the sampled gaussian mechanism, 2019. URL https://arxiv.org/abs/1908.10530

work page arXiv 2019
[34]

R+r:understanding hyperparameter effects in dp-sgd, 2024

Felix Morsbach, Jan Reubold, and Thorsten Strufe. R+r:understanding hyperparameter effects in dp-sgd, 2024. URL https://arxiv.org/abs/2411.02051

work page arXiv 2024
[35]

Data-free quantization through weight equalization and bias correction

Markus Nagel, Mart van Baalen, Tijmen Blankevoort, and Max Welling. Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1325–1334, 2019

work page 2019
[36]

A White Paper on Neural Network Quantization

Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[37]

Nvidia blackwell architecture technical overview, 2024

NVIDIA Corporation. Nvidia blackwell architecture technical overview, 2024. URL https: //resources.nvidia.com/en-us-blackwell-architecture . Accessed: 2025-05-05

work page 2024
[38]

Value-aware quantization for training and inference of neural networks

Eunhyeok Park, Sungjoo Yoo, and Peter Vajda. Value-aware quantization for training and inference of neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 580–595, 2018

work page 2018
[39]

Brendan McMahan, Sergei Vassilvitskii, Steve Chien, and Abhradeep Guha Thakurta

Natalia Ponomareva, Hussein Hazimeh, Alex Kurakin, Zheng Xu, Carson Denison, H. Brendan McMahan, Sergei Vassilvitskii, Steve Chien, and Abhradeep Guha Thakurta. How to dp-fy ml: A practical guide to machine learning with differential privacy. Journal of Artificial Intelligence Research, 77:1113–1201, July 2023. ISSN 1076-9757. doi: 10.1613/jair.1.14649. U...

work page doi:10.1613/jair.1.14649 2023
[40]

Ai hardware cores/accelerators, 2024

Qualcomm Technologies, Inc. Ai hardware cores/accelerators, 2024. URL https://docs.qualcomm.com/bundle/publicresource/topics/80-63195-1/ AI-hardware-cores-accelerators.html . Accessed: 2025-05-05

work page 2024
[41]

Optimal clipping and magnitude-aware differentiation for improved quantization- aware training

Charbel Sakr, Steve Dai, Rangha Venkatesan, Brian Zimmer, William Dally, and Brucek Khailany. Optimal clipping and magnitude-aware differentiation for improved quantization- aware training. In International Conference on Machine Learning, pages 19123–19138. PMLR, 2022

work page 2022
[42]

Towards scalable distributed training of deep learning on public cloud clusters, 2020

Shaohuai Shi, Xianhao Zhou, Shutao Song, Xingyao Wang, Zilin Zhu, Xue Huang, Xinan Jiang, Feihu Zhou, Zhenyu Guo, Liqiang Xie, Rui Lan, Xianbin Ouyang, Yan Zhang, Jieqian Wei, Jing Gong, Weiliang Lin, Ping Gao, Peng Meng, Xiaomin Xu, Chenyang Guo, Bo Yang, Zhibo Chen, Yongjian Wu, and Xiaowen Chu. Towards scalable distributed training of deep learning on ...

work page arXiv 2020
[43]

The german traffic sign recognition benchmark: A multi-class classification competition

Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The german traffic sign recognition benchmark: A multi-class classification competition. InThe 2011 International Joint Conference on Neural Networks, pages 1453–1460, 2011. doi: 10.1109/IJCNN.2011.6033395

work page doi:10.1109/ijcnn.2011.6033395 2011
[44]

Stich, Jean-Baptiste Cordonnier, and Martin Jaggi

Sebastian U. Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified sgd with memory,

work page
[45]

URL https://arxiv.org/abs/1809.07599

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Ultra-low precision 4-bit training of deep neural networks

Xiao Sun, Naigang Wang, Chia-Yu Chen, Jiamin Ni, Ankur Agrawal, Xiaodong Cui, Swagath Venkataramani, Kaoutar El Maghraoui, Vijayalakshmi (Viji) Srinivasan, and Kailash Gopalakrishnan. Ultra-low precision 4-bit training of deep neural networks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Proces...

work page 2020
[47]

Powersgd: Practical low-rank gradient compression for distributed optimization

Thijs V ogels, Sai Praneeth Karimireddy, and Martin Jaggi. Powersgd: Practical low-rank gradient compression for distributed optimization. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[48]

HAQ: Hardware-Aware Automated Quantization with Mixed Precision

Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. Haq: Hardware-aware automated quantization with mixed precision, 2019. URL https://arxiv.org/abs/1811.08886

work page internal anchor Pith review Pith/arXiv arXiv 2019
[49]

TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning

Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning, 2017. URL https: //arxiv.org/abs/1705.07878

work page internal anchor Pith review Pith/arXiv arXiv 2017
[50]

Bitwidth-adaptive quantization-aware neural network training: a meta-learning approach

Jiseok Youn, Jaehun Song, Hyung-Sin Kim, and Saewoong Bahk. Bitwidth-adaptive quantization-aware neural network training: a meta-learning approach. In European Con- ference on Computer Vision, pages 208–224. Springer, 2022

work page 2022
[51]

Randomized quantization is all you need for differential privacy in federated learning, 2023

Yeojoon Youn, Zihao Hu, Juba Ziani, and Jacob Abernethy. Randomized quantization is all you need for differential privacy in federated learning, 2023. URL https://arxiv.org/abs/ 2306.11913

work page arXiv 2023
[52]

Opacus: User-friendly differential privacy library in pytorch.arXiv preprint arXiv:2109.12298, 2021

Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen, Sayan Ghosh, Akash Bharadwaj, Jessica Zhao, Graham Cormode, and Ilya Mironov. Opacus: User-friendly differential privacy library in pytorch, 2022. URL https://arxiv.org/abs/2109.12298

work page arXiv 2022
[53]

On compressing deep models by low rank and sparse decomposition

Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao. On compressing deep models by low rank and sparse decomposition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7370–7379, 2017

work page 2017
[54]

DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients

Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients, 2018. URL https://arxiv.org/abs/1606.06160. 14 A Appendix / supplemental material A.1 Training Hyperparameters While the learning rate might seem too high for regular SGD training, previous res...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[1] [1]

Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang

Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS’16. ACM, October 2016. doi: 10.1145/2976749.2978318. URL http://dx.doi.org/10.1145/2976749.2978318

work page doi:10.1145/2976749.2978318 2016

[2] [2]

2021 , volume =

AmirAli Abdolrashidi, Lisa Wang, Shivani Agrawal, Jonathan Malmaud, Oleg Rybakov, Chas Leichner, and Lukasz Lew. Pareto-optimal quantized resnet is mostly 4-bit. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), page 3085–3093. IEEE, June 2021. doi: 10.1109/cvprw53098.2021.00345. URL http://dx.doi.org/10. 1109/CVPRW...

work page doi:10.1109/cvprw53098.2021.00345 2021

[3] [3]

AMD Instinct ™ MI300X Accelerator Data Sheet: Leading-Edge Accelerator Module for Generative AI, Training, and High-Performance Computing

Advanced Micro Devices, Inc. AMD Instinct ™ MI300X Accelerator Data Sheet: Leading-Edge Accelerator Module for Generative AI, Training, and High-Performance Computing. Technical report, Advanced Micro Devices, Inc., 2023. URL https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/ data-sheets/amd-instinct-mi300x-data-sheet.pdf . Accessed: 2025-05-13

work page 2023

[4] [4]

Data types and precision support

Advanced Micro Devices, Inc. Data types and precision support. https://rocm.docs.amd. com/en/latest/reference/precision-support.html, March 2025. ROCm Documen- tation; Accessed: 2025-05-13

work page 2025

[5] [5]

L-greco: Layerwise-adaptive gradient compression for efficient and accurate deep learning, 2023

Mohammadreza Alimohammadi, Ilia Markov, Elias Frantar, and Dan Alistarh. L-greco: Layerwise-adaptive gradient compression for efficient and accurate deep learning, 2023. URL https://arxiv.org/abs/2210.17357

work page arXiv 2023

[6] [6]

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan V ojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding, 2017. URL https: //arxiv.org/abs/1610.02132

work page internal anchor Pith review Pith/arXiv arXiv 2017

[7] [7]

Post-training 4-bit quantization of convolution networks for rapid-deployment, 2019

Ron Banner, Yury Nahshan, Elad Hoffer, and Daniel Soudry. Post-training 4-bit quantization of convolution networks for rapid-deployment, 2019. URL https://arxiv.org/abs/1810. 05723

work page 2019

[8] [8]

Efficientqat: Efficient quantization-aware training for large language models.arXiv preprint arXiv:2407.11062, 2024

Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. Efficientqat: Efficient quantization-aware training for large language models. arXiv preprint arXiv:2407.11062, 2024

work page arXiv 2024

[9] [9]

Accurate neural training with 4-bit matrix multiplications at standard formats, 2024

Brian Chmiel, Ron Banner, Elad Hoffer, Hilla Ben Yaacov, and Daniel Soudry. Accurate neural training with 4-bit matrix multiplications at standard formats, 2024. URL https: //arxiv.org/abs/2112.10769

work page arXiv 2024

[10] [10]

PACT: Parameterized Clipping Activation for Quantized Neural Networks

Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks, 2018. URL https://arxiv.org/abs/1805.06085

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

EMNIST: an extension of MNIST to handwritten letters

Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. Emnist: an extension of mnist to handwritten letters, 2017. URL https://arxiv.org/abs/1702.05373

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

Smith, and Borja Balle

Soham De, Leonard Berrada, Jamie Hayes, Samuel L. Smith, and Borja Balle. Unlocking high-accuracy differentially private image classification through scale, 2022. URL https: //arxiv.org/abs/2204.13650

work page arXiv 2022

[13] [13]

Cbq: Cross-block quantization for large language models

Xin Ding, Xiaoyu Liu, Zhijun Tu, Yun Zhang, Wei Li, Jie Hu, Hanting Chen, Yehui Tang, Zhiwei Xiong, Baoqun Yin, et al. Cbq: Cross-block quantization for large language models. arXiv preprint arXiv:2312.07950, 2023

work page arXiv 2023

[14] [14]

Hawq: Hessian aware quantization of neural networks with mixed-precision, 2019

Zhen Dong, Zhewei Yao, Amir Gholami, Michael Mahoney, and Kurt Keutzer. Hawq: Hessian aware quantization of neural networks with mixed-precision, 2019. URL https://arxiv. org/abs/1905.03696

work page arXiv 2019

[15] [15]

Dynamic differential-privacy preserving sgd, 2022

Jian Du, Song Li, Xiangyi Chen, Siheng Chen, and Mingyi Hong. Dynamic differential-privacy preserving sgd, 2022. URL https://arxiv.org/abs/2111.00173. 11

work page arXiv 2022

[16] [16]

The algorithmic foundations of differential privacy

Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foun- dations and Trends in Theoretical Computer Science, 9(3–4):211–407, 2014. doi: 10.1561/ 0400000042. URL https://www.nowpublishers.com/article/Details/TCS-042

work page 2014

[17] [17]

Learned step size quantization.arXiv preprint arXiv:1902.08153, 2019

Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S. Modha. Learned step size quantization, 2020. URL https://arxiv.org/ abs/1902.08153

work page arXiv 2020

[18] [18]

Mahoney and Kurt Keutzer , year=

Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference, 2021. URL https: //arxiv.org/abs/2103.13630

work page arXiv 2021

[19] [19]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015. URL https://arxiv.org/abs/1512.03385

work page internal anchor Pith review Pith/arXiv arXiv 2015

[20] [20]

Densely Connected Convolutional Networks

Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks, 2018. URL https://arxiv.org/abs/1608.06993

work page internal anchor Pith review Pith/arXiv arXiv 2018

[21] [21]

Accurate post training quantization with small calibration sets

Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, and Daniel Soudry. Accurate post training quantization with small calibration sets. In International Conference on Machine Learning, pages 4466–4475. PMLR, 2021

work page 2021

[22] [22]

Low-rank compression of neural nets: Learning the rank of each layer

Yerlan Idelbayev and Miguel A Carreira-Perpinán. Low-rank compression of neural nets: Learning the rank of each layer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8049–8059, 2020

work page 2020

[23] [23]

Quantization and training of neural networks for efficient integer-arithmetic-only inference, 2017

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference, 2017. URL https://arxiv.org/abs/1712. 05877

work page 2017

[24] [24]

Sergey Ioffe and Christian Szegedy

Matthew Jagielski, Jonathan Ullman, and Alina Oprea. Auditing differentially private machine learning: How private is private sgd?, 2020. URL https://arxiv.org/abs/2006.07709

work page arXiv 2020

[25] [25]

Accelerating stochastic gradient descent using predictive variance reduction

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper_files/paper/2013/file/ ac1dd209cb...

work page 2013

[26] [26]

Error feedback fixes signsgd and other gradient compression schemes

Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. In International Conference on Machine Learning, pages 3252–3261. PMLR, 2019

work page 2019

[27] [27]

Quantizing deep convolutional networks for efficient inference: A whitepaper

Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [28]

Learning Multiple Layers of Features from Tiny Images

Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Technical re- port, University of Toronto, April 2009. URL https://www.cs.toronto.edu/~kriz/ learning-features-2009-TR.pdf

work page 2009

[29] [29]

Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training, 2020. URL https://arxiv. org/abs/1712.01887

work page arXiv 2020

[30] [30]

Torchvision: Pytorch’s computer vision library

TorchVision maintainers and contributors. Torchvision: Pytorch’s computer vision library. https://github.com/pytorch/vision, 2016

work page 2016

[31] [31]

An optimization framework for differentially private sparse fine-tuning, 2025

Mehdi Makni, Kayhan Behdin, Gabriel Afriat, Zheng Xu, Sergei Vassilvitskii, Natalia Pono- mareva, Hussein Hazimeh, and Rahul Mazumder. An optimization framework for differentially private sparse fine-tuning, 2025. URL https://arxiv.org/abs/2503.12822. 12

work page arXiv 2025

[32] [32]

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellem- pudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, and Hao Wu. Fp8 formats for deep learning, 2022. URL https://arxiv.org/abs/2209.05433

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [33]

arXiv preprint arXiv:1908.10530 (2019)

Ilya Mironov, Kunal Talwar, and Li Zhang. Rényi differential privacy of the sampled gaussian mechanism, 2019. URL https://arxiv.org/abs/1908.10530

work page arXiv 2019

[34] [34]

R+r:understanding hyperparameter effects in dp-sgd, 2024

Felix Morsbach, Jan Reubold, and Thorsten Strufe. R+r:understanding hyperparameter effects in dp-sgd, 2024. URL https://arxiv.org/abs/2411.02051

work page arXiv 2024

[35] [35]

Data-free quantization through weight equalization and bias correction

Markus Nagel, Mart van Baalen, Tijmen Blankevoort, and Max Welling. Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1325–1334, 2019

work page 2019

[36] [36]

A White Paper on Neural Network Quantization

Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[37] [37]

Nvidia blackwell architecture technical overview, 2024

NVIDIA Corporation. Nvidia blackwell architecture technical overview, 2024. URL https: //resources.nvidia.com/en-us-blackwell-architecture . Accessed: 2025-05-05

work page 2024

[38] [38]

Value-aware quantization for training and inference of neural networks

Eunhyeok Park, Sungjoo Yoo, and Peter Vajda. Value-aware quantization for training and inference of neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 580–595, 2018

work page 2018

[39] [39]

Brendan McMahan, Sergei Vassilvitskii, Steve Chien, and Abhradeep Guha Thakurta

Natalia Ponomareva, Hussein Hazimeh, Alex Kurakin, Zheng Xu, Carson Denison, H. Brendan McMahan, Sergei Vassilvitskii, Steve Chien, and Abhradeep Guha Thakurta. How to dp-fy ml: A practical guide to machine learning with differential privacy. Journal of Artificial Intelligence Research, 77:1113–1201, July 2023. ISSN 1076-9757. doi: 10.1613/jair.1.14649. U...

work page doi:10.1613/jair.1.14649 2023

[40] [40]

Ai hardware cores/accelerators, 2024

Qualcomm Technologies, Inc. Ai hardware cores/accelerators, 2024. URL https://docs.qualcomm.com/bundle/publicresource/topics/80-63195-1/ AI-hardware-cores-accelerators.html . Accessed: 2025-05-05

work page 2024

[41] [41]

Optimal clipping and magnitude-aware differentiation for improved quantization- aware training

Charbel Sakr, Steve Dai, Rangha Venkatesan, Brian Zimmer, William Dally, and Brucek Khailany. Optimal clipping and magnitude-aware differentiation for improved quantization- aware training. In International Conference on Machine Learning, pages 19123–19138. PMLR, 2022

work page 2022

[42] [42]

Towards scalable distributed training of deep learning on public cloud clusters, 2020

Shaohuai Shi, Xianhao Zhou, Shutao Song, Xingyao Wang, Zilin Zhu, Xue Huang, Xinan Jiang, Feihu Zhou, Zhenyu Guo, Liqiang Xie, Rui Lan, Xianbin Ouyang, Yan Zhang, Jieqian Wei, Jing Gong, Weiliang Lin, Ping Gao, Peng Meng, Xiaomin Xu, Chenyang Guo, Bo Yang, Zhibo Chen, Yongjian Wu, and Xiaowen Chu. Towards scalable distributed training of deep learning on ...

work page arXiv 2020

[43] [43]

The german traffic sign recognition benchmark: A multi-class classification competition

Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The german traffic sign recognition benchmark: A multi-class classification competition. InThe 2011 International Joint Conference on Neural Networks, pages 1453–1460, 2011. doi: 10.1109/IJCNN.2011.6033395

work page doi:10.1109/ijcnn.2011.6033395 2011

[44] [44]

Stich, Jean-Baptiste Cordonnier, and Martin Jaggi

Sebastian U. Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified sgd with memory,

work page

[45] [45]

URL https://arxiv.org/abs/1809.07599

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

Ultra-low precision 4-bit training of deep neural networks

Xiao Sun, Naigang Wang, Chia-Yu Chen, Jiamin Ni, Ankur Agrawal, Xiaodong Cui, Swagath Venkataramani, Kaoutar El Maghraoui, Vijayalakshmi (Viji) Srinivasan, and Kailash Gopalakrishnan. Ultra-low precision 4-bit training of deep neural networks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Proces...

work page 2020

[47] [47]

Powersgd: Practical low-rank gradient compression for distributed optimization

Thijs V ogels, Sai Praneeth Karimireddy, and Martin Jaggi. Powersgd: Practical low-rank gradient compression for distributed optimization. Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[48] [48]

HAQ: Hardware-Aware Automated Quantization with Mixed Precision

Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. Haq: Hardware-aware automated quantization with mixed precision, 2019. URL https://arxiv.org/abs/1811.08886

work page internal anchor Pith review Pith/arXiv arXiv 2019

[49] [49]

TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning

Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning, 2017. URL https: //arxiv.org/abs/1705.07878

work page internal anchor Pith review Pith/arXiv arXiv 2017

[50] [50]

Bitwidth-adaptive quantization-aware neural network training: a meta-learning approach

Jiseok Youn, Jaehun Song, Hyung-Sin Kim, and Saewoong Bahk. Bitwidth-adaptive quantization-aware neural network training: a meta-learning approach. In European Con- ference on Computer Vision, pages 208–224. Springer, 2022

work page 2022

[51] [51]

Randomized quantization is all you need for differential privacy in federated learning, 2023

Yeojoon Youn, Zihao Hu, Juba Ziani, and Jacob Abernethy. Randomized quantization is all you need for differential privacy in federated learning, 2023. URL https://arxiv.org/abs/ 2306.11913

work page arXiv 2023

[52] [52]

Opacus: User-friendly differential privacy library in pytorch.arXiv preprint arXiv:2109.12298, 2021

Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen, Sayan Ghosh, Akash Bharadwaj, Jessica Zhao, Graham Cormode, and Ilya Mironov. Opacus: User-friendly differential privacy library in pytorch, 2022. URL https://arxiv.org/abs/2109.12298

work page arXiv 2022

[53] [53]

On compressing deep models by low rank and sparse decomposition

Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao. On compressing deep models by low rank and sparse decomposition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7370–7379, 2017

work page 2017

[54] [54]

DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients

Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients, 2018. URL https://arxiv.org/abs/1606.06160. 14 A Appendix / supplemental material A.1 Training Hyperparameters While the learning rate might seem too high for regular SGD training, previous res...

work page internal anchor Pith review Pith/arXiv arXiv 2018