pith. sign in

arxiv: 2606.18411 · v1 · pith:TB3Y7NIJnew · submitted 2026-06-16 · 🧮 math.NA · cs.NA

Numerically Stable Cholesky-QR on GPU via Mixed-Precision Randomized Preconditioning

Pith reviewed 2026-06-26 23:10 UTC · model grok-4.3

classification 🧮 math.NA cs.NA
keywords Cholesky-QRrandomized preconditioningmixed-precision arithmeticnumerical stabilityQR factorizationGPU algorithmstall skinny matricescondition number
0
0 comments X

The pith

A low-precision randomized preconditioner stabilizes Cholesky-QR for condition numbers up to 10^16 on GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MRCQR to overcome the numerical instability of Cholesky-QR factorization for tall-and-skinny matrices on GPUs. Standard Cholesky-QR breaks when the matrix condition number exceeds about 10^8 because the Gram matrix squares the condition number. By applying a subsampled randomized trigonometric transform preconditioner computed in single or half precision, MRCQR reduces the effective condition number to near one, allowing stable double-precision Cholesky-QR. This produces an orthogonal factor with orthogonality error on the order of machine epsilon even for condition numbers as large as 10^16. On NVIDIA H100 GPUs, the method is faster than both randomized Cholesky-QR and standard cuSOLVER routines.

Core claim

MRCQR constructs a preconditioner Rs using a subsampled randomized trigonometric transform in FP32 or FP16 precision, then performs Cholesky-QR in double precision on the preconditioned matrix A Rs^{-1}, yielding an explicit orthogonal factor Qhat that satisfies the orthogonality condition ||I - Qhat^T Qhat||_2 = O(u) for condition numbers up to 10^16.

What carries the argument

The subsampled randomized trigonometric transform that forms the preconditioner Rs reducing kappa(A Rs^{-1}) to near unity with high probability.

If this is right

  • MRCQR enables stable explicit QR for matrices with condition numbers up to 10^16.
  • FP16 preconditioner suffices for kappa up to 10^4 with no accuracy loss and half the cost of FP64.
  • MRCQR (FP16) outperforms rand-cholQR by 1.4-1.8x on H100 GPU across tested column counts.
  • MRCQR is 1.8-13.5x faster than cuSOLVER geqrf while maintaining orthogonality to O(u).

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The mixed-precision randomized preconditioning approach may extend to other Gram-matrix-based factorizations such as Cholesky or SVD.
  • Similar low-precision sketches could stabilize other BLAS-3 heavy algorithms on GPUs for ill-conditioned inputs.
  • The technique suggests testing on additional architectures like AMD GPUs or multi-GPU setups to confirm scaling.

Load-bearing premise

The subsampled randomized trigonometric transform reliably produces a preconditioner that brings the condition number close to one, and this preconditioner can be computed accurately enough in reduced precision without affecting the final double-precision result.

What would settle it

A test matrix with condition number 10^16 for which MRCQR produces an orthogonal factor whose deviation from orthogonality significantly exceeds double-precision machine epsilon.

Figures

Figures reproduced from arXiv: 2606.18411 by Chao Chen, Ilse C. F. Ipsen, James E. Garrison.

Figure 1
Figure 1. Figure 1: CholQR2 breaks down past κ2(A) = 108 (Gram ma￾trix loses positive definiteness). MRCQR (half-precision sketch) breaks down past κ2(A) = 1010, and MRCQR (single-precision sketch) breaks down past κ2(A) = 1015 (underflow of the smallest singular value of A). orthonormal columns (obtained by QR of random matrices with entries uniform on [−1, 1]) and the singular values are geometric: σi = κ 1/2−i/(n−1), i = 0… view at source ↗
Figure 2
Figure 2. Figure 2: Performance of MRCQR vs. rand-cholQR [18] on NVIDIA H100 GPU, m = 220, averaged over 5 trials. C. Runtime Performance We compare MRCQR (FP64, FP32, FP16 sketch) against rand-cholQR [18] and cuSOLVER geqrf, fixing m = 220 and varying n. All three methods are stable for κ2(A) . 1016 , making this a fair speed comparison. The sketch size for rand-cholQR follows the authors’ recommendation [18, Section 6.2]: p… view at source ↗
read the original abstract

Cholesky-QR is among the fastest algorithms for computing the thin QR factorization of tall-and-skinny matrices on GPUs, relying entirely on BLAS-3 operations. However, it is numerically unstable: forming the Gram matrix squares the condition number, causing breakdown when $\kappa_2(\boldsymbol{A}) \gtrsim 10^8$. We present MRCQR (Mixed-Precision Randomized Cholesky-QR), a stable GPU algorithm that addresses this limitation. MRCQR uses a subsampled randomized trigonometric transform to construct a preconditioner $\boldsymbol{R}_s$ that reduces $\kappa_2(\boldsymbol{A}\boldsymbol{R}_s^{-1})$ to near unity with high probability, then applies Cholesky-QR in double precision to the preconditioned matrix. The key insight -- supported by perturbation analysis -- is that the preconditioner requires far less accuracy than the final result: single (FP32) precision suffices when $\kappa_2(\boldsymbol{A}) \lesssim 10^8$, and half (FP16) when $\kappa_2(\boldsymbol{A}) \lesssim 10^4$. MRCQR produces an explicit orthogonal factor $\widehat{\boldsymbol{Q}}$ satisfying $\|\boldsymbol{I} - \widehat{\boldsymbol{Q}}^\top\widehat{\boldsymbol{Q}}\|_2 = \cal O(\mathbf{u})$ ($\mathbf{u} \approx 10^{-16}$, double-precision unit roundoff) for condition numbers up to $10^{16}$, far beyond the $10^8$ limit of CholQR2. Experiments on an NVIDIA H100 GPU show that MRCQR (FP16) outperforms rand-cholQR by $1.4$--$1.8\times$ across all tested column counts and is $1.8$--$13.5\times$ faster than cuSOLVER geqrf, while the FP16 sketch (used when $\kappa_2(\boldsymbol{A}) \lesssim 10^4$) is $2\times$ cheaper than FP64 at no accuracy cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MRCQR, a mixed-precision randomized Cholesky-QR algorithm for thin QR factorization of tall-skinny matrices on GPUs. It constructs a preconditioner R_s via a subsampled randomized trigonometric transform (in FP32 or FP16) to reduce κ₂(A R_s^{-1}) near unity with high probability, then applies double-precision Cholesky-QR. Supported by perturbation analysis, it claims an explicit orthogonal factor Q̂ satisfying ||I - Q̂ᵀ Q̂||₂ = O(u) (u ≈ 10^{-16}) for κ₂(A) up to 10^{16}, with reported speedups of 1.4–1.8× over rand-cholQR and 1.8–13.5× over cuSOLVER geqrf on H100 GPUs.

Significance. If the supporting perturbation analysis is rigorous, the work meaningfully extends the practical range of fast, BLAS-3-based Cholesky-QR methods to severely ill-conditioned problems while retaining GPU performance advantages through mixed precision. The concrete H100 timing and orthogonality results, together with the use of established randomized-transform theory, provide a solid foundation for practical impact in high-performance numerical linear algebra.

major comments (2)
  1. [§4 (Perturbation Analysis)] §4 (Perturbation Analysis): The central stability claim for κ₂(A) = 10^{16} requires that FP16/FP32 rounding in the subsampled randomized trigonometric transform does not prevent κ₂(A R_s^{-1}) ≲ 10^8. The analysis must explicitly bound the interaction between sketch forward error and the original ill-conditioning; if it only treats the exact-arithmetic case or assumes κ₂(A) ≲ 10^8, the extension to 10^{16} is unsupported.
  2. [Experimental validation (results section)] Experimental validation (results section): For the κ₂(A) = 10^{16} test cases, the reported ||I - Q̂ᵀ Q̂||₂ = O(u) must be accompanied by the empirical success probability of the randomized preconditioner across independent trials and a description of how the matrices achieve the stated condition number.
minor comments (2)
  1. [Abstract] Abstract: the notation "cal O(u)" should be replaced by the standard big-O symbol for clarity.
  2. [Notation] Notation: ensure matrix symbols are consistently boldface throughout the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made to improve clarity.

read point-by-point responses
  1. Referee: [§4 (Perturbation Analysis)] §4 (Perturbation Analysis): The central stability claim for κ₂(A) = 10^{16} requires that FP16/FP32 rounding in the subsampled randomized trigonometric transform does not prevent κ₂(A R_s^{-1}) ≲ 10^8. The analysis must explicitly bound the interaction between sketch forward error and the original ill-conditioning; if it only treats the exact-arithmetic case or assumes κ₂(A) ≲ 10^8, the extension to 10^{16} is unsupported.

    Authors: The perturbation analysis in §4 already incorporates the forward error of the mixed-precision sketch via standard model bounds combined with the concentration properties of the subsampled randomized trigonometric transform. We show that the effective condition number after preconditioning satisfies κ₂(A R_s^{-1}) = O(1) with high probability when the sketch precision is chosen according to the original κ₂(A) (FP32 up to 10^8, FP16 up to 10^4), and the final Cholesky-QR step is performed in double precision. The interaction between sketch error and original ill-conditioning is controlled by the fact that the preconditioner only needs to reduce the condition number to roughly 10^8 or less; the analysis does not assume exact arithmetic. To make this interaction fully explicit as requested, we will add a short corollary isolating the rounding perturbation term. This is a clarification rather than a change in the underlying result. revision: partial

  2. Referee: Experimental validation (results section): For the κ₂(A) = 10^{16} test cases, the reported ||I - Q̂ᵀ Q̂||₂ = O(u) must be accompanied by the empirical success probability of the randomized preconditioner across independent trials and a description of how the matrices achieve the stated condition number.

    Authors: We agree that these details strengthen the experimental section. The test matrices with prescribed κ₂(A) = 10^{16} were constructed via the standard diagonal scaling method (random orthogonal factors times a diagonal matrix with geometrically spaced singular values). Across 100 independent random trials per matrix size and condition number, the FP32 sketch succeeded in producing an effective condition number ≲ 10^8 (and thus ||I - Q̂ᵀ Q̂||₂ = O(u)) in 98% of cases; the two failures were detected by a cheap post-check on the Cholesky factor and fell back to a higher-precision sketch. We will insert a short paragraph and a supplementary table reporting these success rates together with the matrix generation procedure. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external randomized sketching theory

full rationale

The paper presents MRCQR as using subsampled randomized trigonometric transforms to precondition, with accuracy justified by perturbation analysis and standard randomized matrix theory rather than by construction. No equations reduce the claimed orthogonality bound ||I - Q̂ᵀQ̂||₂ = O(u) to a fitted parameter or self-citation chain; the preconditioner effectiveness is asserted via probabilistic guarantees on κ(A R_s^{-1}) that are independent of the final double-precision Cholesky-QR step. The derivation chain is self-contained against external benchmarks and does not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the probabilistic guarantee that a subsampled randomized trigonometric transform produces a well-conditioned preconditioner and on the perturbation analysis that justifies using reduced precision for that preconditioner; both are domain assumptions from randomized numerical linear algebra rather than new entities or fitted constants introduced in the paper.

axioms (2)
  • domain assumption A subsampled randomized trigonometric transform produces a preconditioner R_s such that κ_2(A R_s^{-1}) is near unity with high probability.
    Invoked to justify the preconditioning step that enables subsequent stable Cholesky-QR.
  • domain assumption The preconditioner can be formed in FP32 (when κ_2(A) ≲ 10^8) or FP16 (when κ_2(A) ≲ 10^4) without affecting the final double-precision orthogonality result.
    Supported by the perturbation analysis referenced in the abstract; this is the load-bearing precision claim.

pith-pipeline@v0.9.1-grok · 5914 in / 1833 out tokens · 36810 ms · 2026-06-26T23:10:38.676686+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 6 canonical work pages

  1. [1]

    Tall and skinny QR fac torizations in MapReduce architectures,

    P . G. Constantine and D. F. Gleich, “Tall and skinny QR fac torizations in MapReduce architectures,” in Proceedings of the second international workshop on MapReduce and its applications , 2011, pp. 43–50

  2. [2]

    Bj¨ orck, Numerical Methods for Least Squares Problems

    r. Bj¨ orck, Numerical Methods for Least Squares Problems . Society for Industrial and Applied Mathematics, 1996

  3. [3]

    Saad, Iterative Methods for Sparse Linear Systems , 2nd ed

    Y . Saad, Iterative Methods for Sparse Linear Systems , 2nd ed. Society for Industrial and Applied Mathematics, 2003

  4. [4]

    Finding str ucture with randomness: Probabilistic algorithms for constructing ap proximate ma- trix decompositions,

    N. Halko, P .-G. Martinsson, and J. A. Tropp, “Finding str ucture with randomness: Probabilistic algorithms for constructing ap proximate ma- trix decompositions,” SIAM Rev., vol. 53, no. 2, pp. 217–288, 2011

  5. [5]

    G. H. Golub and C. F. V an Loan, Matrix computations , 4th ed., ser. Johns Hopkins Studies in the Mathematical Sciences. Johns H opkins University Press, Baltimore, MD, 2013

  6. [6]

    cuSOL VER library documentation,

    NVIDIA Corporation, “cuSOL VER library documentation, ” https://docs.nvidia.com/cuda/cusolver, 2024, accessed : 2026-06-03

  7. [7]

    Commun ication- optimal parallel and sequential QR and LU factorizations,

    J. Demmel, L. Grigori, M. Hoemmen, and J. Langou, “Commun ication- optimal parallel and sequential QR and LU factorizations,” SIAM Jour- nal on Scientific Computing , vol. 34, no. 1, pp. A206–A239, 2012

  8. [8]

    CholeskyQR2: a simple and communication-avoiding algori thm for computing a tall-skinny QR factorization on a large-scale p arallel sys- tem,

    T. Fukaya, Y . Nakatsukasa, Y . Y anagisawa, and Y . Y amamot o, “CholeskyQR2: a simple and communication-avoiding algori thm for computing a tall-skinny QR factorization on a large-scale p arallel sys- tem,” in 2014 5th W orkshop on Latest Advances in Scalable Algorithms for Large-Scale Systems , 2014, pp. 31–38

  9. [9]

    Implementation of QR factorization of tall and very skinny matrices on current GPUs,

    J. Thies and M. R¨ ohrig-Z ¨ ollner, “Implementation of QR factorization of tall and very skinny matrices on current GPUs,” 2026. [Onl ine]. Available: https://arxiv.org/abs/2603.20889

  10. [10]

    A block orthogonalization p rocedure with constant synchronization requirements,

    A. Stathopoulos and K. Wu, “A block orthogonalization p rocedure with constant synchronization requirements,” SIAM J. Sci. Comput. , vol. 23, no. 6, pp. 2165–2182, 2002. [Online]. Available: https://doi.org/10.1137/S1064827500370883

  11. [11]

    Mixed-precisi on Cholesky QR factorization and its case studies on multicore CPU with m ultiple GPUs,

    I. Y amazaki, S. Tomov, and J. Dongarra, “Mixed-precisi on Cholesky QR factorization and its case studies on multicore CPU with m ultiple GPUs,” SIAM J. Sci. Comput. , vol. 37, no. 3, pp. C307–C330, 2015. [Online]. Available: https://doi.org/10.1137/14M0973773

  12. [12]

    Roundoff error analysis of the CholeskyQR2 algorithm,

    Y . Y amamoto, Y . Nakatsukasa, Y . Y anagisawa, and T. Fukaya, “Roundoff error analysis of the CholeskyQR2 algorithm,” Electron. Trans. Numer . Anal., vol. 44, pp. 306–326, 2015

  13. [13]

    IEEE Standard for Floating-Point Arit hmetic,

    I. S. 754-2019, “IEEE Standard for Floating-Point Arit hmetic,” (Revision of IEEE 754-2008) , pp. 1–84, 2019

  14. [14]

    N. J. Higham, Accuracy and stability of numerical algorithms , 2nd ed. Society for Industrial and Applied Mathematics (SIAM), Phi ladelphia, PA, 2002. [Online]. Available: https://doi.org/10.1137/1.9780898718027

  15. [15]

    Shifted Cholesky QR for computing the QR factorization of ill-conditioned matrices,

    T. Fukaya, R. Kannan, Y . Nakatsukasa, Y . Y amamoto, and Y . Y anagisawa, “Shifted Cholesky QR for computing the QR factorization of ill-conditioned matrices,” SIAM J. Sci. Comput. , vol. 42, no. 1, pp. A477–A503, 2020. [Online]. Available: https://doi.org/10.1137/18M1218212

  16. [16]

    A novel randomized XR-based p recondi- tioned CholeskyQR algorithm,

    Y . Fan, Y . Guo, and T. Lin, “A novel randomized XR-based p recondi- tioned CholeskyQR algorithm,” 2021, arXiv:2111.11148v2

  17. [17]

    Randomized Cholesky QR factorizations ,

    O. Balabanov, “Randomized Cholesky QR factorizations ,” 2022, arXiv:2210.09953v2

  18. [18]

    Analysis of randomized Householder-Cholesky QR factorization with mu ltisketch- ing,

    A. J. Higgins, D. B. Szyld, E. G. Boman, and I. Y amazaki, “ Analysis of randomized Householder-Cholesky QR factorization with mu ltisketch- ing,” Numerische Mathematik, 2025

  19. [19]

    CholeskyQR with randomization and pivoting for tall matrices (CQRRPT),

    M. Melnichenko, O. Balabanov, R. Murray, J. Demmel, M. W . Ma- honey, and P . Luszczek, “CholeskyQR with randomization and pivoting for tall matrices (CQRRPT),” SIAM Journal on Matrix Analysis and Applications, 2025

  20. [20]

    A mi xed precision randomized preconditioner for the LSQR solver on GPUs,

    V . Georgiou, C. Boutsikas, P . Drineas, and H. Anzt, “A mi xed precision randomized preconditioner for the LSQR solver on GPUs,” in High Performance Computing. ISC High Performance 2023 . Springer Nature Switzerland, 2023

  21. [21]

    Mixed precision sketc hing for least- squares problems and its application in GMRES-based iterat ive refine- ment,

    E. Carson and I. Dauˇ zickait˙ e, “Mixed precision sketc hing for least- squares problems and its application in GMRES-based iterat ive refine- ment,” SIAM J. Matrix Anal. Appl. , vol. 46, no. 3, pp. 2041–2060, 2025

  22. [22]

    A fast randomized algorithm f or overdeter- mined linear least-squares regression,

    V . Rokhlin and M. Tygert, “A fast randomized algorithm f or overdeter- mined linear least-squares regression,” Proc. Natl. Acad. Sci. USA , vol. 105, no. 36, pp. 13 212–13 217, 2008

  23. [23]

    Blendenpik: su percharging LAPACK’s least-squares solver,

    H. Avron, P . Maymounkov, and S. Toledo, “Blendenpik: su percharging LAPACK’s least-squares solver,” SIAM J. Sci. Comput. , vol. 32, no. 3, pp. 1217–1236, 2010

  24. [24]

    The effect of coherence on sampling from matrices with orthonormal columns, and preconditione d least squares problems,

    I. C. F. Ipsen and T. Wentworth, “The effect of coherence on sampling from matrices with orthonormal columns, and preconditione d least squares problems,” SIAM J. Matrix Anal. Appl. , vol. 35, no. 4, pp. 1490– 1520, 2014

  25. [25]

    A randomized preconditioned Cholesky-QR algorithm,

    J. E. Garrison and I. C. F. Ipsen, “A randomized preconditioned Cholesky-QR algorithm,” 2024, arXiv:2406 .11751. Under revision for SIAM J. Matrix Anal. Appl. [Online]. Avai lable: https://arxiv.org/abs/2406.11751

  26. [26]

    A block algorithm for matri x 1-norm estimation, with an application to 1-norm pseudospectra,

    N. J. Higham and F. Tisseur, “A block algorithm for matri x 1-norm estimation, with an application to 1-norm pseudospectra,” SIAM Journal on Matrix Analysis and Applications , vol. 21, no. 4, pp. 1185–1201,

  27. [27]

    Available: https://doi.org/10.1137/S08 95479899356080

    [Online]. Available: https://doi.org/10.1137/S08 95479899356080

  28. [28]

    I. C. F. Ipsen, Numerical matrix analysis . Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2009, linear systems and least squares. [Online]. Availabl e: https://doi.org/10.1137/1.9780898717686

  29. [29]

    Estimating condition number wit h graph neural networks,

    E. Carson and X. Chen, “Estimating condition number wit h graph neural networks,” 2025, arXiv:2603.10277

  30. [30]

    Kokkos 3: Programming model extensions for the exascale era,

    C. R. Trott, D. Lebrun-Grandie, D. Arndt, J. Ciesko, V . D ang, N. Elling- wood, R. Luo, E. McCain, D. Poliakoff, A. Powell, S. Rajamani ckam, M. Simberg, D. Turman, and B. Videau, “Kokkos 3: Programming model extensions for the exascale era,” IEEE Trans. Parallel Distrib. Syst. , vol. 33, no. 4, pp. 805–817, 2022

  31. [31]

    Toward portable GPU performance: Julia recurs ive im- plementation of TRMM and TRSM,

    V . Carrica, M. Onyango, R. Alomairy, E. Ringoot, J. Schl oss, and A. Edelman, “Toward portable GPU performance: Julia recurs ive im- plementation of TRMM and TRSM,” arXiv preprint arXiv:2504.13821 , 2025