Mixed-Precision Communication-Avoiding SGD for Generalized Linear Models on GPUs

Aditya Devarakonda; Giulia Guidi; Irene Sim\'o Mu\~noz

arxiv: 2606.18463 · v1 · pith:AHKZBBHQnew · submitted 2026-06-16 · 💻 cs.DC · cs.LG· cs.NA· math.NA· stat.ML

Mixed-Precision Communication-Avoiding SGD for Generalized Linear Models on GPUs

Aditya Devarakonda , Irene Sim\'o Mu\~noz , Giulia Guidi This is my paper

Pith reviewed 2026-06-26 22:19 UTC · model grok-4.3

classification 💻 cs.DC cs.LGcs.NAmath.NAstat.ML

keywords mixed-precisioncommunication-avoiding SGDgeneralized linear modelsGram matrixfinite-precision analysisAllReducestochastic gradient descentGPUs

0 comments

The pith

Mixed-precision CA-SGD stores inputs low, accumulates Gram high, and communicates high to match FP32 loss within 0.5% while delivering 5.1-6.8x speedup on A100 GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a mixed-precision recipe for communication-avoiding SGD on generalized linear models. It stores the input matrix and margin vector in low precision, computes the Gram matrix from those low-precision values with high-precision accumulation, communicates the Gram matrix in high precision, and runs the inner recurrence and weight updates in high precision. A finite-precision analysis decomposes the rounding error of one outer iteration into nine independent choices that depend on hardware only through low-precision unit roundoffs. On NERSC Perlmutter A100 GPUs the resulting method reaches speedups of 5.1 to 6.8 times over FP32 SGD on logistic, linear, and Poisson problems while keeping loss within 0.5 percent.

Core claim

The finite-precision analysis decomposes the local rounding error of one CA-SGD outer iteration into nine independent precision choices depending on the hardware only through its low-precision unit roundoffs. The derived recipe stores the input matrix and margin vector in low precision, computes the Gram matrix from low-precision inputs with high-precision accumulation, communicates it in high precision, and performs the inner recurrence and weight updates in high precision. On A100 GPUs this produces loss within 0.5% of FP32 SGD on logistic, linear, and Poisson regression while achieving 5.1-6.8x speedup.

What carries the argument

The finite-precision error analysis that decomposes one CA-SGD outer iteration's rounding error into nine independent precision choices controlled by low-precision unit roundoffs.

If this is right

The same precision recipe applies to logistic, linear, and Poisson generalized linear models.
Loss stays within 0.5% of FP32 SGD while communication is reduced from s AllReduces to one Gram-matrix AllReduce.
Speedups of 5.1-6.8x are observed on epsilon, SUSY, HIGGS, synth, and Poisson-synth datasets.
The nine precision choices depend on hardware only through low-precision unit roundoffs, so the recipes transfer in principle across GPU generations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The nine-choice decomposition supplies a template that could be reused to analyze rounding error in other iterative methods built around Gram-matrix AllReduces.
Because the recipe separates low-precision storage from high-precision accumulation and communication, the same structure may reduce bandwidth pressure in distributed training of models beyond generalized linear models.
Practitioners can adopt the listed precision assignments on any NVIDIA GPU supporting the same low-precision formats without re-deriving the error bounds.

Load-bearing premise

The nine rounding-error terms arising from different precision choices remain sufficiently independent that their aggregate effect on the final loss stays bounded without hardware-specific interactions beyond the unit roundoffs.

What would settle it

Measure the loss of the mixed-precision CA-SGD recipe on the same logistic, linear, or Poisson problems on a newer GPU generation; deviation greater than 0.5% from FP32 SGD would falsify transferability of the nine-choice decomposition.

Figures

Figures reproduced from arXiv: 2606.18463 by Aditya Devarakonda, Giulia Guidi, Irene Sim\'o Mu\~noz.

**Figure 2.** Figure 2: AllReduce constant C(P) measured directly on Perlmutter NCCL collectives versus rank count P, summing rank-local vectors of length L ≥ 131072 against an FP64-accumulated reference. The observed factor reaches the worst-case C(P) ≲ 2.4P and is datatype-independent to within 4% across BF16, FP16, and FP32. The dotted curve is the power-law fit 4.62P 0.56 and the dashed curve is the prior 3 + log2 (P/4) assum… view at source ↗

**Figure 3.** Figure 3: Relative loss gap to Recipe A on a long-horizon synthetic logistic regression stress run [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Scaling of CA-SGD and SGD on NERSC Perlmutter A100 under various precision set [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Per-kernel measured roofline at P = 1, m = 8192, nloc = 16,384, b = 32, H = 20. Theoretical roofs (FP32 scalar 19.5, TF32 156, BF16 312 TFLOP/s) against 2039 GB/s HBM2e (knees at 9.56, 76.5, 153 FLOP/B). BF16 roof), rising to 285.9 TFLOP/s (92% of peak) at s = 128. The 4× jump from TF32 to BF16 at modest s is the empirical reason Recipe C uses BF16-input tensor-core GEMM. Recipe F uses FP16 inputs with an … view at source ↗

read the original abstract

Distributed stochastic gradient descent (SGD) is limited by communication rather than computation, since each iteration requires an AllReduce across processes. Communication-avoiding SGD (CA-SGD) amortizes communication over $s$ iterations by replacing $s$ consecutive AllReduces with a single AllReduce of an $sb\times sb$ Gram matrix, trading more computation and bandwidth for fewer synchronization points. Modern GPUs with matrix hardware and reduced-precision formats offset this by accelerating the Gram GEMM and shrinking BF16 traffic. We study mixed-precision CA-SGD for generalized linear models on NVIDIA GPUs. Our finite-precision analysis decomposes the local rounding error of one CA-SGD outer iteration into nine independent precision choices, depending on the hardware only through its low-precision unit roundoffs, so the resulting recipes transfer in principle across GPU generations. The recipe stores the input matrix and margin vector in low precision, computes the Gram matrix from low-precision inputs with high-precision accumulation, communicates it in high precision, and performs the inner recurrence and weight updates in high precision. On NERSC Perlmutter A100 GPUs, mixed-precision CA-SGD matches FP32 SGD loss within $0.5\%$ on logistic, linear, and Poisson problems and reaches $5.1$--$6.8\times$ speedup over FP32 SGD on epsilon, SUSY, HIGGS, synth, and Poisson-synth. Our software is available at https://doi.org/10.5281/zenodo.20448273

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete mixed-precision recipe for CA-SGD on GLMs that delivers measured 5-7x speedups on A100s with loss within 0.5% of FP32, supported by an explicit nine-term error decomposition.

read the letter

This paper shows a mixed-precision CA-SGD that runs 5-7 times faster than FP32 on A100s while staying close in loss, using a nine-part error breakdown to choose precisions.

The main contribution is the decomposition of local rounding error into nine independent terms that depend on hardware only through low-precision unit roundoffs. From that they derive a specific recipe: store the input matrix and margin vector in low precision, form the Gram matrix with high-precision accumulation, communicate in high precision, and keep the inner recurrence and updates in high precision. The experiments on logistic, linear, and Poisson problems across several datasets back the claim that loss stays within 0.5% of full-precision SGD.

The work is useful because it ships real hardware numbers on Perlmutter and releases the code. That makes the speedup claims checkable rather than theoretical.

The soft spot is the transferability argument. The analysis treats the nine error sources as independent and determined solely by unit roundoffs, but GPU tensor-core FMAs, shared-memory access, and reduction patterns can introduce dimension-dependent or warp-level correlations that the model does not capture. If those effects matter, the recipes may need retuning per architecture even if the abstract claims otherwise.

The paper is for researchers who run distributed GLM training on GPUs and want to reduce AllReduce frequency without losing accuracy. It deserves peer review because the speedup data and open implementation give referees something concrete to evaluate, even if the finite-precision details require closer inspection.

Referee Report

2 major / 2 minor

Summary. The paper introduces mixed-precision communication-avoiding SGD (CA-SGD) for generalized linear models on NVIDIA GPUs. It presents a finite-precision analysis that decomposes the local rounding error of one CA-SGD outer iteration into nine independent precision choices depending on hardware only through low-precision unit roundoffs. The resulting recipe stores the input matrix and margin vector in low precision, computes the Gram matrix from low-precision inputs with high-precision accumulation, communicates the Gram matrix in high precision, and performs the inner recurrence and weight updates in high precision. On NERSC Perlmutter A100 GPUs, the method matches FP32 SGD loss within 0.5% on logistic, linear, and Poisson problems while achieving 5.1--6.8× speedup over FP32 SGD on several datasets; open-source code is provided.

Significance. If the error decomposition holds and the reported speedups and accuracy are reproducible, the work provides a practical route to reduce communication overhead in distributed GLM training while preserving accuracy through mixed precision on tensor-core GPUs. The explicit dependence on unit roundoffs (rather than hardware-specific details) supports potential transferability across GPU generations, and the open code at the cited DOI is a clear strength for verification and extension.

major comments (2)

[Abstract and §3] Abstract and §3 (finite-precision analysis): the claim that the local rounding error decomposes into nine independent precision choices depending on hardware solely through low-precision unit roundoffs underpins both the recipe and the transferability assertion. However, this decomposition does not address potential hardware-specific correlations from tensor-core FMA behavior, shared-memory bank conflicts, or warp-level reductions, which could introduce cross terms that vary with matrix dimensions and block size s; without explicit validation of the independence assumption against measured GPU roundoff for the tested s values, the 0.5% loss-matching guarantee may not transfer.
[§5] §5 (experiments): the reported 5.1--6.8× speedup and 0.5% loss match are central to the practical claim, yet the manuscript provides no breakdown of how the nine precision choices were mapped to the actual A100 kernels (e.g., which operations used TF32 vs. BF16 accumulation) or ablation showing that altering any one choice violates the error bound; this makes it impossible to confirm that the observed results follow from the analysis rather than from unmodeled hardware effects.

minor comments (2)

[Abstract] The abstract states the datasets (epsilon, SUSY, HIGGS, synth, Poisson-synth) but does not specify problem dimensions or the range of s values used; adding these would clarify the regime in which the speedups hold.
Notation for the nine precision choices is introduced without an explicit table mapping each choice to the corresponding operation (input storage, Gram accumulation, etc.); a compact table would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important aspects of the finite-precision analysis and its connection to experiments. We address each point below and propose targeted revisions to clarify assumptions and implementation details.

read point-by-point responses

Referee: [Abstract and §3] the claim that the local rounding error decomposes into nine independent precision choices depending on hardware solely through low-precision unit roundoffs underpins both the recipe and the transferability assertion. However, this decomposition does not address potential hardware-specific correlations from tensor-core FMA behavior, shared-memory bank conflicts, or warp-level reductions, which could introduce cross terms that vary with matrix dimensions and block size s; without explicit validation of the independence assumption against measured GPU roundoff for the tested s values, the 0.5% loss-matching guarantee may not transfer.

Authors: We appreciate the referee's observation on the independence assumption in the error model. The decomposition follows the standard floating-point analysis framework (bounding each operation's error by its unit roundoff independently), which is conventional in mixed-precision literature. The nine choices are identified by tracing distinct operations in one CA-SGD outer iteration. We acknowledge that tensor-core and memory-hierarchy effects may introduce correlations outside this model. In the revision we will add an explicit paragraph in §3 stating the assumption, its relation to hardware-specific behavior, and the resulting scope of the transferability claim, supported by references to standard error analysis texts. revision: partial
Referee: [§5] the reported 5.1--6.8× speedup and 0.5% loss match are central to the practical claim, yet the manuscript provides no breakdown of how the nine precision choices were mapped to the actual A100 kernels (e.g., which operations used TF32 vs. BF16 accumulation) or ablation showing that altering any one choice violates the error bound; this makes it impossible to confirm that the observed results follow from the analysis rather than from unmodeled hardware effects.

Authors: We agree that a clearer mapping between the nine choices and A100 kernels would strengthen the link between analysis and results. The recipe in §3 directly dictates the mapping: BF16 storage for inputs, FP32 accumulation in tensor-core GEMM for the Gram matrix, FP32 for the communicated Gram matrix, and FP32 for the inner recurrence and updates. In revision we will insert a table in §5 that explicitly lists each of the nine operations, the chosen format/accumulation mode, and the corresponding CUDA/Tensor Core primitive used on A100. A full per-choice ablation would require substantial new runs; however, the error bounds derived in §3 already indicate the necessity of each choice to keep the local error below the observed 0.5% threshold. We will add a short paragraph referencing the bounds to explain why deviations would be expected to increase error. revision: partial

Circularity Check

0 steps flagged

No circularity: error analysis uses standard roundoff model; results are empirical

full rationale

The paper presents a finite-precision decomposition of CA-SGD rounding error into nine independent choices governed solely by low-precision unit roundoffs, then reports direct experimental loss matching (within 0.5%) and speedups on A100 GPUs. No equation reduces a claimed prediction to a fitted parameter by construction, no load-bearing premise rests on a self-citation chain, and no ansatz is imported via prior work by the same authors. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no free parameters, invented entities, or ad-hoc axioms are stated. The nine precision choices are design decisions analyzed via standard floating-point error models rather than fitted quantities.

axioms (1)

standard math Floating-point rounding errors in matrix operations can be decomposed into independent contributions from each precision choice and depend only on unit roundoffs
Invoked to justify transferability of the recipe across GPU generations.

pith-pipeline@v0.9.1-grok · 5825 in / 1345 out tokens · 36462 ms · 2026-06-26T22:19:27.343961+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 19 canonical work pages

[1]

Ahmad Ajalloeian and Sebastian U. Stich. On the convergence of SGD with biased gradients. arXiv preprint arXiv:2008.00051, 2020

arXiv 2008
[2]

doi:10.1137/16M1080173 , eprint =

L´ eon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.SIAM Review, 60(2):223–311, 2018. doi: 10.1137/16M1080173

work page doi:10.1137/16m1080173 2018
[3]

The adaptives-step conjugate gradient method.SIAM Journal on Matrix Analysis and Applications, 39(3):1318–1338, 2018

Erin Carson. The adaptives-step conjugate gradient method.SIAM Journal on Matrix Analysis and Applications, 39(3):1318–1338, 2018. doi: 10.1137/16M1107942

work page doi:10.1137/16m1107942 2018
[4]

A residual replacement strategy for improving the maximum attainable accuracy ofs-step Krylov subspace methods.SIAM Journal on Matrix Analysis and Applications, 35(1):22–43, 2014

Erin Carson and James Demmel. A residual replacement strategy for improving the maximum attainable accuracy ofs-step Krylov subspace methods.SIAM Journal on Matrix Analysis and Applications, 35(1):22–43, 2014. doi: 10.1137/120893057

work page doi:10.1137/120893057 2014
[5]

Accuracy of thes-step Lanczos method for the symmetric eigenproblem in finite precision.SIAM Journal on Matrix Analysis and Applications, 36(2): 793–819, 2015

Erin Carson and James Demmel. Accuracy of thes-step Lanczos method for the symmetric eigenproblem in finite precision.SIAM Journal on Matrix Analysis and Applications, 36(2): 793–819, 2015. doi: 10.1137/140990735

work page doi:10.1137/140990735 2015
[6]

Erin Carson and Nicholas J. Higham. A new analysis of iterative refinement and its applica- tion to accurate solution of ill-conditioned sparse linear systems.SIAM Journal on Scientific Computing, 39(6):A2834–A2856, 2017. doi: 10.1137/17M1122918

work page doi:10.1137/17m1122918 2017
[7]

Erin Carson and Nicholas J. Higham. Accelerating the solution of linear systems by iterative refinement in three precisions.SIAM Journal on Scientific Computing, 40(2):A817–A847,
[8]

doi: 10.1137/17M1140819

work page doi:10.1137/17m1140819
[9]

Higham, and Srikara Pranesh

Erin Carson, Nicholas J. Higham, and Srikara Pranesh. Three-precision GMRES-based iter- ative refinement for least squares problems.SIAM Journal on Scientific Computing, 42(6): A4063–A4083, 2020. doi: 10.1137/20M1316822

work page doi:10.1137/20m1316822 2020
[10]

Mixed precisions-step Lanczos and conjugate gradient algorithms.Numerical Linear Algebra with Applications, 29(3):e2425, 2022

Erin Carson, Tom´ aˇ s Gergelits, and Ichitaro Yamazaki. Mixed precisions-step Lanczos and conjugate gradient algorithms.Numerical Linear Algebra with Applications, 29(3):e2425, 2022. doi: 10.1002/nla.2425

work page doi:10.1002/nla.2425 2022
[11]

PhD thesis, University of California, Berkeley, Berkeley, CA, 2015

Erin Claire Carson.Communication-Avoiding Krylov Subspace Methods in Theory and Prac- tice. PhD thesis, University of California, Berkeley, Berkeley, CA, 2015

2015
[12]

Collective com- munication: Theory, practice, and experience.Concurrency and Computation: Practice and Experience, 19(13):1749–1783, 2007

Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert van de Geijn. Collective com- munication: Theory, practice, and experience.Concurrency and Computation: Practice and Experience, 19(13):1749–1783, 2007. doi: 10.1002/cpe.1206

work page doi:10.1002/cpe.1206 2007
[13]

Communication-optimal parallel and sequential QR and LU factorizations.SIAM Journal on Scientific Computing, 34 (1):A206–A239, 2012

James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou. Communication-optimal parallel and sequential QR and LU factorizations.SIAM Journal on Scientific Computing, 34 (1):A206–A239, 2012. doi: 10.1137/080731992. 17

work page doi:10.1137/080731992 2012
[14]

Demmel, Michael T

James W. Demmel, Michael T. Heath, and Henk A. van der Vorst. Parallel numerical linear algebra. InActa Numerica, volume 2, pages 111–197, Cambridge, UK, 1993. Cambridge University Press. doi: 10.1017/S096249290000235X

work page doi:10.1017/s096249290000235x 1993
[15]

Extending SLURM for dynamic resource-aware adaptive batch scheduling,

Aditya Devarakonda and James Demmel. Avoiding communication in logistic regression. In 2020 IEEE 27th International Conference on High Performance Computing, Data, and Ana- lytics (HiPC), pages 91–100. IEEE, 2020. doi: 10.1109/HiPC50609.2020.00023

work page doi:10.1109/hipc50609.2020.00023 2020
[16]

Communication-efficient, 2d parallel stochas- tic gradient descent for distributed-memory optimization.arXiv preprint arXiv:2501.07526, 2025

Aditya Devarakonda and Ramakrishnan Kannan. Communication-efficient, 2d parallel stochas- tic gradient descent for distributed-memory optimization.arXiv preprint arXiv:2501.07526, 2025

Pith/arXiv arXiv 2025
[17]

Golub and Charles F

Gene H. Golub and Charles F. Van Loan.Matrix Computations. Johns Hopkins University Press, Baltimore, MD, 4 edition, 2013

2013
[18]

Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J. Higham. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. InSC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 603–613. IEEE Press, 2018. doi: 10.1109/SC.2018.00050

work page doi:10.1109/sc.2018.00050 2018
[19]

Higham.Accuracy and Stability of Numerical Algorithms

Nicholas J. Higham.Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia, PA, 2nd edition, 2002. doi: 10.1137/1.9780898718027

work page doi:10.1137/1.9780898718027 2002
[20]

Higham and Theo Mary

Nicholas J. Higham and Theo Mary. A new approach to probabilistic rounding error analysis. SIAM Journal on Scientific Computing, 41(5):A2815–A2835, 2019. doi: 10.1137/18M1226312

work page doi:10.1137/18m1226312 2019
[21]

Higham and Theo Mary

Nicholas J. Higham and Theo Mary. Sharper probabilistic backward error analysis for basic linear algebra kernels with random data.SIAM Journal on Scientific Computing, 42(5):A3427– A3446, 2020. doi: 10.1137/20M1314355

work page doi:10.1137/20m1314355 2020
[22]

Higham and Srikara Pranesh

Nicholas J. Higham and Srikara Pranesh. Simulating low precision floating-point arithmetic. SIAM Journal on Scientific Computing, 41(5):C585–C602, 2019. doi: 10.1137/19M1251308

work page doi:10.1137/19m1251308 2019
[23]

PhD thesis, University of California, Berkeley, Berkeley, CA, 2010

Mark Frederick Hoemmen.Communication-Avoiding Krylov Subspace Methods. PhD thesis, University of California, Berkeley, Berkeley, CA, 2010

2010
[24]

A study of BFLOAT16 for deep learning training

Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudarshan Srini- vasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, and Pradeep Dubey. A study of BFLOAT16 for ...

Pith/arXiv arXiv 1905
[25]

Nelder.Generalized Linear Models

Peter McCullagh and John A. Nelder.Generalized Linear Models. Chapman & Hall, London, UK, 2nd edition, 1989. doi: 10.1007/978-1-4899-3242-6

work page doi:10.1007/978-1-4899-3242-6 1989
[26]

Mixed precision training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Gar- cia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. InInternational Conference on Learning Representations (ICLR),
[27]

URLhttps://openreview.net/forum?id=r1gs9JgRZ
[28]

NVIDIA A100 tensor core GPU: Data sheet.https://www.nvidia

NVIDIA Corporation. NVIDIA A100 tensor core GPU: Data sheet.https://www.nvidia. com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet. pdf, 2021. 18

2021
[29]

NVIDIA Collective Communications Library (NCCL) documentation

NVIDIA Corporation. NVIDIA Collective Communications Library (NCCL) documentation. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html, 2024

2024
[30]

cuBLAS documentation.https://docs.nvidia.com/cuda/cublas/, 2026

NVIDIA Corporation. cuBLAS documentation.https://docs.nvidia.com/cuda/cublas/, 2026

2026
[31]

CUDA Programming Guide.https://docs.nvidia.com/cuda/ cuda-programming-guide/, 2026

NVIDIA Corporation. CUDA Programming Guide.https://docs.nvidia.com/cuda/ cuda-programming-guide/, 2026

2026
[32]

Optimization of collective communi- cation operations in MPICH.International Journal of High Performance Computing Applica- tions, 19(1):49–66, 2005

Rajeev Thakur, Rolf Rabenseifner, and William Gropp. Optimization of collective communi- cation operations in MPICH.International Journal of High Performance Computing Applica- tions, 19(1):49–66, 2005. doi: 10.1177/1094342005051521

work page doi:10.1177/1094342005051521 2005
[33]

Enhanced cyclic coordinate descent methods for elastic net penalized linear models

Yixiao Wang, Zishan Shao, Ting Jiang, and Aditya Devarakonda. Enhanced cyclic coordinate descent methods for elastic net penalized linear models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. URLhttps://openreview.net/forum?id=duunKHvWKz. 19

2025

[1] [1]

Ahmad Ajalloeian and Sebastian U. Stich. On the convergence of SGD with biased gradients. arXiv preprint arXiv:2008.00051, 2020

arXiv 2008

[2] [2]

doi:10.1137/16M1080173 , eprint =

L´ eon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.SIAM Review, 60(2):223–311, 2018. doi: 10.1137/16M1080173

work page doi:10.1137/16m1080173 2018

[3] [3]

The adaptives-step conjugate gradient method.SIAM Journal on Matrix Analysis and Applications, 39(3):1318–1338, 2018

Erin Carson. The adaptives-step conjugate gradient method.SIAM Journal on Matrix Analysis and Applications, 39(3):1318–1338, 2018. doi: 10.1137/16M1107942

work page doi:10.1137/16m1107942 2018

[4] [4]

A residual replacement strategy for improving the maximum attainable accuracy ofs-step Krylov subspace methods.SIAM Journal on Matrix Analysis and Applications, 35(1):22–43, 2014

Erin Carson and James Demmel. A residual replacement strategy for improving the maximum attainable accuracy ofs-step Krylov subspace methods.SIAM Journal on Matrix Analysis and Applications, 35(1):22–43, 2014. doi: 10.1137/120893057

work page doi:10.1137/120893057 2014

[5] [5]

Accuracy of thes-step Lanczos method for the symmetric eigenproblem in finite precision.SIAM Journal on Matrix Analysis and Applications, 36(2): 793–819, 2015

Erin Carson and James Demmel. Accuracy of thes-step Lanczos method for the symmetric eigenproblem in finite precision.SIAM Journal on Matrix Analysis and Applications, 36(2): 793–819, 2015. doi: 10.1137/140990735

work page doi:10.1137/140990735 2015

[6] [6]

Erin Carson and Nicholas J. Higham. A new analysis of iterative refinement and its applica- tion to accurate solution of ill-conditioned sparse linear systems.SIAM Journal on Scientific Computing, 39(6):A2834–A2856, 2017. doi: 10.1137/17M1122918

work page doi:10.1137/17m1122918 2017

[7] [7]

Erin Carson and Nicholas J. Higham. Accelerating the solution of linear systems by iterative refinement in three precisions.SIAM Journal on Scientific Computing, 40(2):A817–A847,

[8] [8]

doi: 10.1137/17M1140819

work page doi:10.1137/17m1140819

[9] [9]

Higham, and Srikara Pranesh

Erin Carson, Nicholas J. Higham, and Srikara Pranesh. Three-precision GMRES-based iter- ative refinement for least squares problems.SIAM Journal on Scientific Computing, 42(6): A4063–A4083, 2020. doi: 10.1137/20M1316822

work page doi:10.1137/20m1316822 2020

[10] [10]

Mixed precisions-step Lanczos and conjugate gradient algorithms.Numerical Linear Algebra with Applications, 29(3):e2425, 2022

Erin Carson, Tom´ aˇ s Gergelits, and Ichitaro Yamazaki. Mixed precisions-step Lanczos and conjugate gradient algorithms.Numerical Linear Algebra with Applications, 29(3):e2425, 2022. doi: 10.1002/nla.2425

work page doi:10.1002/nla.2425 2022

[11] [11]

PhD thesis, University of California, Berkeley, Berkeley, CA, 2015

Erin Claire Carson.Communication-Avoiding Krylov Subspace Methods in Theory and Prac- tice. PhD thesis, University of California, Berkeley, Berkeley, CA, 2015

2015

[12] [12]

Collective com- munication: Theory, practice, and experience.Concurrency and Computation: Practice and Experience, 19(13):1749–1783, 2007

Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert van de Geijn. Collective com- munication: Theory, practice, and experience.Concurrency and Computation: Practice and Experience, 19(13):1749–1783, 2007. doi: 10.1002/cpe.1206

work page doi:10.1002/cpe.1206 2007

[13] [13]

Communication-optimal parallel and sequential QR and LU factorizations.SIAM Journal on Scientific Computing, 34 (1):A206–A239, 2012

James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou. Communication-optimal parallel and sequential QR and LU factorizations.SIAM Journal on Scientific Computing, 34 (1):A206–A239, 2012. doi: 10.1137/080731992. 17

work page doi:10.1137/080731992 2012

[14] [14]

Demmel, Michael T

James W. Demmel, Michael T. Heath, and Henk A. van der Vorst. Parallel numerical linear algebra. InActa Numerica, volume 2, pages 111–197, Cambridge, UK, 1993. Cambridge University Press. doi: 10.1017/S096249290000235X

work page doi:10.1017/s096249290000235x 1993

[15] [15]

Extending SLURM for dynamic resource-aware adaptive batch scheduling,

Aditya Devarakonda and James Demmel. Avoiding communication in logistic regression. In 2020 IEEE 27th International Conference on High Performance Computing, Data, and Ana- lytics (HiPC), pages 91–100. IEEE, 2020. doi: 10.1109/HiPC50609.2020.00023

work page doi:10.1109/hipc50609.2020.00023 2020

[16] [16]

Communication-efficient, 2d parallel stochas- tic gradient descent for distributed-memory optimization.arXiv preprint arXiv:2501.07526, 2025

Aditya Devarakonda and Ramakrishnan Kannan. Communication-efficient, 2d parallel stochas- tic gradient descent for distributed-memory optimization.arXiv preprint arXiv:2501.07526, 2025

Pith/arXiv arXiv 2025

[17] [17]

Golub and Charles F

Gene H. Golub and Charles F. Van Loan.Matrix Computations. Johns Hopkins University Press, Baltimore, MD, 4 edition, 2013

2013

[18] [18]

Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J. Higham. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. InSC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 603–613. IEEE Press, 2018. doi: 10.1109/SC.2018.00050

work page doi:10.1109/sc.2018.00050 2018

[19] [19]

Higham.Accuracy and Stability of Numerical Algorithms

Nicholas J. Higham.Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia, PA, 2nd edition, 2002. doi: 10.1137/1.9780898718027

work page doi:10.1137/1.9780898718027 2002

[20] [20]

Higham and Theo Mary

Nicholas J. Higham and Theo Mary. A new approach to probabilistic rounding error analysis. SIAM Journal on Scientific Computing, 41(5):A2815–A2835, 2019. doi: 10.1137/18M1226312

work page doi:10.1137/18m1226312 2019

[21] [21]

Higham and Theo Mary

Nicholas J. Higham and Theo Mary. Sharper probabilistic backward error analysis for basic linear algebra kernels with random data.SIAM Journal on Scientific Computing, 42(5):A3427– A3446, 2020. doi: 10.1137/20M1314355

work page doi:10.1137/20m1314355 2020

[22] [22]

Higham and Srikara Pranesh

Nicholas J. Higham and Srikara Pranesh. Simulating low precision floating-point arithmetic. SIAM Journal on Scientific Computing, 41(5):C585–C602, 2019. doi: 10.1137/19M1251308

work page doi:10.1137/19m1251308 2019

[23] [23]

PhD thesis, University of California, Berkeley, Berkeley, CA, 2010

Mark Frederick Hoemmen.Communication-Avoiding Krylov Subspace Methods. PhD thesis, University of California, Berkeley, Berkeley, CA, 2010

2010

[24] [24]

A study of BFLOAT16 for deep learning training

Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudarshan Srini- vasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, and Pradeep Dubey. A study of BFLOAT16 for ...

Pith/arXiv arXiv 1905

[25] [25]

Nelder.Generalized Linear Models

Peter McCullagh and John A. Nelder.Generalized Linear Models. Chapman & Hall, London, UK, 2nd edition, 1989. doi: 10.1007/978-1-4899-3242-6

work page doi:10.1007/978-1-4899-3242-6 1989

[26] [26]

Mixed precision training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Gar- cia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. InInternational Conference on Learning Representations (ICLR),

[27] [27]

URLhttps://openreview.net/forum?id=r1gs9JgRZ

[28] [28]

NVIDIA A100 tensor core GPU: Data sheet.https://www.nvidia

NVIDIA Corporation. NVIDIA A100 tensor core GPU: Data sheet.https://www.nvidia. com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet. pdf, 2021. 18

2021

[29] [29]

NVIDIA Collective Communications Library (NCCL) documentation

NVIDIA Corporation. NVIDIA Collective Communications Library (NCCL) documentation. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html, 2024

2024

[30] [30]

cuBLAS documentation.https://docs.nvidia.com/cuda/cublas/, 2026

NVIDIA Corporation. cuBLAS documentation.https://docs.nvidia.com/cuda/cublas/, 2026

2026

[31] [31]

CUDA Programming Guide.https://docs.nvidia.com/cuda/ cuda-programming-guide/, 2026

NVIDIA Corporation. CUDA Programming Guide.https://docs.nvidia.com/cuda/ cuda-programming-guide/, 2026

2026

[32] [32]

Optimization of collective communi- cation operations in MPICH.International Journal of High Performance Computing Applica- tions, 19(1):49–66, 2005

Rajeev Thakur, Rolf Rabenseifner, and William Gropp. Optimization of collective communi- cation operations in MPICH.International Journal of High Performance Computing Applica- tions, 19(1):49–66, 2005. doi: 10.1177/1094342005051521

work page doi:10.1177/1094342005051521 2005

[33] [33]

Enhanced cyclic coordinate descent methods for elastic net penalized linear models

Yixiao Wang, Zishan Shao, Ting Jiang, and Aditya Devarakonda. Enhanced cyclic coordinate descent methods for elastic net penalized linear models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. URLhttps://openreview.net/forum?id=duunKHvWKz. 19

2025