pith. machine review for the scientific record. sign in

arxiv: 2605.13209 · v1 · submitted 2026-05-13 · 💻 cs.DC · cs.PF

Recognition: no theorem link

Comparing the Performance of Heterogeneous Conjugate Gradient and Cholesky Solvers on Various Hardware Using SYCL

Alexander Strack, Dirk Pfl\"uger, Tim Th\"uring

Pith reviewed 2026-05-14 01:59 UTC · model grok-4.3

classification 💻 cs.DC cs.PF
keywords heterogeneous computingSYCLconjugate gradientCholesky decompositionGPUCPUperformance comparisonlinear solvers
0
0 comments X

The pith

Heterogeneous CPU-GPU implementations of CG and Cholesky solvers run up to 32 percent faster than GPU-only versions for large matrices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that splitting the work of solving large symmetric positive-definite linear systems between CPU and GPU in the same machine can deliver better performance than running everything on the GPU alone. The authors build portable implementations of the conjugate gradient method and Cholesky decomposition using SYCL that keep both processors busy at once. On big matrices this yields speedups of up to 32 percent for CG and 29 percent for Cholesky relative to tuned GPU-only code, with the Cholesky version also running at least 12 percent faster across NVIDIA, AMD, and Intel GPUs. The approach matters for applications such as Gaussian-process system identification that repeatedly solve very large systems and currently leave the CPU idle while the GPU works.

Core claim

The heterogeneous implementations of the CG method and the Cholesky decomposition that leverage the CPU and GPU simultaneously using SYCL achieve up to 32 percent faster runtimes for the CG method and up to 29 percent faster for the Cholesky decomposition compared to the corresponding GPU-only implementations on large matrices. In addition, the heterogeneous Cholesky implementation achieves at least 12 percent faster runtimes across several systems with GPUs from NVIDIA, AMD, and Intel.

What carries the argument

Heterogeneous scheduling of CG iterations and Cholesky factorization steps across CPU and GPU using SYCL, with explicit management of data movement and synchronization.

If this is right

  • For large matrices the heterogeneous CG solver finishes up to 32 percent sooner than the GPU-only version.
  • For large matrices the heterogeneous Cholesky solver finishes up to 29 percent sooner than the GPU-only version.
  • The heterogeneous Cholesky solver delivers at least 12 percent faster runtimes on systems containing NVIDIA, AMD, or Intel GPUs.
  • GPU-only implementations leave the CPUs of modern HPC nodes idle during these linear solves.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same heterogeneous split could be applied to other direct and iterative solvers for symmetric positive-definite systems.
  • Better CPU utilization may reduce overall energy use in HPC runs that currently idle the CPU during GPU kernels.
  • SYCL-based heterogeneous scheduling offers a practical path to portable performance gains without vendor-specific rewrites.

Load-bearing premise

The overhead of data movement and synchronization between CPU and GPU remains low enough for the heterogeneous schedule to outperform a well-tuned GPU-only kernel on the tested matrix sizes and hardware configurations.

What would settle it

Runtime measurements on matrices large enough or interconnects slow enough that data-transfer costs exceed the benefit of CPU participation, producing slower heterogeneous times than the GPU-only baseline.

Figures

Figures reproduced from arXiv: 2605.13209 by Alexander Strack, Dirk Pfl\"uger, Tim Th\"uring.

Figure 1
Figure 1. Figure 1: Different workload distributions for the heteroge [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Heterogeneous and homogeneous runtime com [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of the homogeneous and heteroge [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of the homogeneous and heteroge [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Different workload distributions for the heteroge [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Heterogeneous and homogeneous runtime com [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of the homogeneous and hetero [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of the homogeneous and heterogeneous CG algorithm and Cholesky algorithm on all hardware found in [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
read the original abstract

Many important real-world applications, such as System Identification with Gaussian Processes, involve solving linear systems with symmetric positive-definite matrices. The iterative CG method and direct solvers based on the Cholesky decomposition are two popular methods that can be applied in this case. Since often very large systems have to be solved when dealing with such real-world scenarios, GPUs are commonly used to accelerate the computations. However, homogeneous approaches that only leverage the GPU in the system do not take full advantage of the often powerful CPUs located in modern HPC systems. In this work, we present multi-vendor, heterogeneous implementations of the CG method and the Cholesky decomposition that leverage the CPU and GPU of a heterogeneous system simultaneously using SYCL. Furthermore, we compare their runtime behavior to traditional, homogeneous approaches. The results show that for large matrices, our heterogeneous implementation is up to 32 percent faster for the CG method and up to 29 percent faster for the Cholesky decomposition compared to the corresponding GPU-only implementations. In addition, for large matrices, our heterogeneous implementation of the Cholesky decomposition can achieve at least 12 percent faster runtimes across several systems with GPUs from NVIDIA, AMD, and Intel.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents multi-vendor heterogeneous implementations of the Conjugate Gradient (CG) method and Cholesky decomposition that simultaneously use CPU and GPU via SYCL. It reports empirical runtime comparisons against homogeneous GPU-only baselines, claiming up to 32% speedup for CG and 29% for Cholesky on large matrices, plus at least 12% Cholesky gains across NVIDIA, AMD, and Intel GPUs.

Significance. If the speedups can be substantiated with overhead breakdowns, the work would demonstrate practical benefits of heterogeneous scheduling for large symmetric positive-definite linear systems in HPC, with the SYCL-based multi-vendor portability as a notable strength for reproducibility across hardware.

major comments (2)
  1. [Results] Results section: aggregate timings are presented for the claimed 32% CG and 29% Cholesky speedups, but no breakdown of host-device data movement costs, synchronization stalls, or CPU kernel execution share is provided for the tested matrix sizes. This directly undermines verification that the heterogeneous schedule outperforms a well-tuned GPU-only baseline rather than arising from SYCL runtime or baseline differences.
  2. [Abstract and Results] Abstract and Results: concrete percentage speedups are stated without accompanying matrix dimensions, iteration counts, convergence tolerances, or precise baseline kernel details, preventing independent reproduction or assessment of whether the overhead assumption holds for the reported 'large matrices'.
minor comments (1)
  1. [Abstract] The abstract refers to 'several systems' for the 12% Cholesky claim but does not list the specific GPU models or matrix sizes used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. The comments highlight important aspects for improving the clarity and reproducibility of our results on heterogeneous SYCL solvers. We address each major comment below and plan to revise the manuscript accordingly to strengthen the presentation of our findings.

read point-by-point responses
  1. Referee: [Results] Results section: aggregate timings are presented for the claimed 32% CG and 29% Cholesky speedups, but no breakdown of host-device data movement costs, synchronization stalls, or CPU kernel execution share is provided for the tested matrix sizes. This directly undermines verification that the heterogeneous schedule outperforms a well-tuned GPU-only baseline rather than arising from SYCL runtime or baseline differences.

    Authors: We agree that a detailed breakdown would enhance the verification of our claims. In the revised manuscript, we will add a new subsection or table in the Results section providing the breakdown of execution times for data movement, CPU kernel execution, and synchronization overheads for the largest matrix sizes. This will demonstrate that the observed speedups stem from effective heterogeneous scheduling rather than implementation artifacts. Our baselines are implemented consistently within the same SYCL framework to ensure comparability. revision: yes

  2. Referee: [Abstract and Results] Abstract and Results: concrete percentage speedups are stated without accompanying matrix dimensions, iteration counts, convergence tolerances, or precise baseline kernel details, preventing independent reproduction or assessment of whether the overhead assumption holds for the reported 'large matrices'.

    Authors: We acknowledge the need for these details to support reproducibility. We will revise the abstract and expand the Results section to explicitly state the matrix dimensions used (ranging from small to large, with specifics for the reported speedups), the number of iterations, the convergence tolerance (1e-6), and precise descriptions of the baseline GPU-only kernels. This will allow readers to assess the applicability of our overhead assumptions. revision: yes

Circularity Check

0 steps flagged

Empirical runtime comparison with no derivations or self-referential predictions

full rationale

The paper reports measured wall-clock times for SYCL-based CG and Cholesky implementations on CPU+GPU versus GPU-only baselines across several matrix sizes and vendors. No equations, fitted parameters, uniqueness theorems, or ansatzes are introduced; the central claims are direct empirical deltas (up to 32% and 29% faster) obtained from benchmark runs. No load-bearing step reduces to a self-citation or to a quantity defined from the same data. The work is therefore self-contained against external timing measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or new entities are introduced; the work is an empirical performance study of existing solvers.

pith-pipeline@v0.9.0 · 5515 in / 1098 out tokens · 42580 ms · 2026-05-14T01:59:46.415354+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 1 internal anchor

  1. [1]

    Ahmad Abdelfattah, Natalie Beams, Robert Carson, Pieter Ghysels, Tzanio Kolev, Thomas Stitt, Arturo Vargas, Stanimire Tomov, and Jack Dongarra. 2024. MAGMA: Enabling exascale performance with accelerated BLAS and LAPACK for diverse GPU architectures.The International Journal of High Performance Computing Applications38, 5 (2024), 468–490. doi:10.1177/1094...

  2. [2]

    Dolz, Francisco D

    Pedro Alonso, Manuel F. Dolz, Francisco D. Igual, Rafael Mayo, and Enrique S. Quintana-Ortí. 2012. Reducing Energy Consumption of Dense Linear Alge- bra Operations on Hybrid CPU-GPU Platforms. In2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications. IEEE, 56–62. doi:10.1109/ISPA.2012.16

  3. [3]

    Aksel Alpay and Vincent Heuveline. 2020. SYCL beyond OpenCL: The archi- tecture, current state and future direction of hipSYCL. InProceedings of the International Workshop on OpenCL (IWOCL ’20). Association for Computing Machinery, New York, NY, USA, 1 pages. doi:10.1145/3388333.3388658

  4. [4]

    AMD. 2022. AMD INSTINCT™MI210 ACCELERATOR. https: //www.amd.com/content/dam/amd/en/documents/instinct-business- docs/product-briefs/instinct-mi210-brochure.pdf

  5. [5]

    AMD. 2023. AMD EPYC™9004 SERIES PROCESSORS. https://www.amd.com/ content/dam/amd/en/documents/epyc-business-docs/datasheets/amd-epyc- 9004-series-processors-datasheet.pdf

  6. [6]

    Igor Baratta, Chris Richardson, and Garth Wells. 2022. Performance analysis of matrix-free conjugate gradient kernels using SYCL. InProceedings of the 10th International Workshop on OpenCL (IWOCL ’22). Association for Computing Machinery, New York, NY, USA, 10 pages. doi:10.1145/3529538.3529993

  7. [7]

    Ravi Bhargava and Kai Troester. 2024. AMD Next-Generation “Zen 4” Core and 4th Gen AMD EPYC Server CPUs.IEEE Micro44, 3 (2024), 8–17. doi:10.1109/MM. 2024.3375070

  8. [8]

    Salvatore Cali, William Detmold, Grzegorz Korcyl, Piotr Korcyl, and Phiala Shana- han. 2021. Implementation of the conjugate gradient algorithm for heterogeneous systems. doi:10.48550/arXiv.2111.14958

  9. [9]

    Chakraborty and B

    S. Chakraborty and B. Bhattacharyya. 2002. An efficient 3D stochastic finite element method.International Journal of Solids and Structures39, 9 (2002), 2465–

  10. [10]

    doi:10.1016/S0020-7683(02)00080-X

  11. [11]

    DeepL SE. 2026. DeepL Translator. https://www.deepl.com/translator. Accessed: 2026-01-14

  12. [12]

    Jack Dongarra, Mark Gates, Azzam Haidar, Jakub Kurzak, Piotr Luszczek, Panruo Wu, Ichitaro Yamazaki, Asim Yarkhan, Maksims Abalenkovs, Negin Bagherpour, Sven Hammarling, Jakub Šístek, David Stevens, Mawussi Zounon, and Samuel D. Relton. 2019. PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP.ACM Trans. Math. Softw.45, 2, Article 16 (2019...

  13. [13]

    Joseph Dorris, Jakub Kurzak, Piotr Luszczek, Asim YarKhan, and Jack Dongarra

  14. [14]

    In High Performance Computing

    Task-Based Cholesky Decomposition on Knights Corner Using OpenMP. In High Performance Computing. Springer International Publishing, Cham, 544–562. doi:10.1007/978-3-319-46079-6_37

  15. [15]

    Mark Harris. [n. d.].Optimizing Parallel Reduction in CUDA. https://developer. download.nvidia.com/assets/cuda/files/reduction.pdf

  16. [17]

    2025.Replication Data for: GPRat: Gaussian Process Regression with Asynchronous Tasks

    Maksim Helmann, Alexander Strack, and Dirk Pflüger. 2025.Replication Data for: GPRat: Gaussian Process Regression with Asynchronous Tasks. doi:10.18419/ DARUS-4743

  17. [18]

    Magnus R Hestenes and Eduard Stiefel. 1952. Methods of conjugate gradients for solving linear systems.Journal of research of the National Bureau of Standards49, 6 (1952), 409–436. https://nvlpubs.nist.gov/nistpubs/jres/049/jresv49n6p409_a1b. pdf

  18. [19]

    Intel. 2019. Intel®Core™i9-10980XE Extreme Edition Processor. https://www. intel.com/content/www/us/en/products/sku/198017/intel-core-i910980xe- extreme-edition-processor-24-75m-cache-3-00-ghz/specifications.html

  19. [20]

    Intel. 2024. Intel®Arc™B580 Grafik. https://www.intel.de/content/www/de/ de/products/sku/241598/intel-arc-b580-graphics/specifications.html

  20. [21]

    Khronos-Group. [n. d.].SYCL - C++ Programming for Heterogeneous Parallel Computing. https://www.khronos.org/sycl/

  21. [22]

    2016.Modelling and Control of Dynamic Systems Using Gaussian Process Models

    Juš Kocijan. 2016.Modelling and Control of Dynamic Systems Using Gaussian Process Models. Springer International Publishing, Cham. doi:10.1007/978-3-319- 21021-6

  22. [23]

    Jens Lang and Gudula Rünger. 2013. Dynamic Distribution of Workload between CPU and GPU for a Parallel Conjugate Gradient Method in an Adaptive FEM. Procedia Computer Science18 (2013), 299–308. doi:10.1016/j.procs.2013.05.193 2013 International Conference on Computational Science

  23. [24]

    Hatem Ltaief, Stanimire Tomov, Rajib Nath, and Jack Dongarra. 2010. Hybrid multicore cholesky factorization with multiple gpu accelerators. (2010). https://www.netlib.org/utk/people/JD/JackDongarra/PAPERS/hybrid- multicore-cholesky.pdf

  24. [25]

    Trevor Maguire. 2011. Multi-processor cholesky decomposition of conductance matrices. InProceeding of the 2011 International Conference on Power System Transients (IPST-2011), Delft, The Netherlands. https://www.ipstconf.org/papers/ Proc_IPST2011/11IPST106.pdf

  25. [26]

    Rajib Nath, Stanimire Tomov, Tingxing "Tim" Dong, and Jack Dongarra. 2011-11. Optimizing symmetric dense matrix-vector multiplication on GPUs. InProceed- ings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 10 pages. doi:10.1145/2063384.2063392

  26. [27]

    Nedozhogin, Sergey P

    Nikita S. Nedozhogin, Sergey P. Kopysov, and Alexandr K. Novikov. 2022. Scal- ability Pipelined Algorithm of the Conjugate Gradient Method on Heteroge- neous Platforms. InMesh Methods for Boundary-Value Problems and Applications. Springer International Publishing, Cham, 347–362. doi:10.1007/978-3-030-87809- 2_27

  27. [28]

    NVIDIA. 2021. NVIDIA AMPERE GA102 GPU ARCHITECTURE. https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu- architecture-whitepaper-v2.pdf

  28. [29]

    NVIDIA. 2022. NVIDIA A30 TENSOR CORE GPU. https://www.nvidia. com/content/dam/en-zz/Solutions/data-center/products/a30-gpu/pdf/a30- datasheet.pdf

  29. [30]

    OpenAI. 2025. ChatGPT 5. https://openai.com/chatgpt. Accessed: 2026-01-14

  30. [31]

    Superhuman Platform. 2026. Grammarly. https://www.grammarly.com/. Ac- cessed: 2026-01-14

  31. [32]

    2023.Parallel Programming for Multicore and Cluster Systems(3 ed.)

    Thomas Rauber and Gudula Rünger. 2023.Parallel Programming for Multicore and Cluster Systems(3 ed.). Springer International Publishing. doi:10.1007/978-3- 031-28924-8

  32. [33]

    Jonathan Richard Shewchuk. 1994. An Introduction to the Conjugate Gradient Method Without the Agonizing Pain. (1994)

  33. [34]

    Fengguang Song and Jack Dongarra. 2015. A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems.Concurrency and Compu- tation: Practice and Experience27, 14 (2015), 3702–3723. doi:10.1002/cpe.3403

  34. [35]

    Fengguang Song, Stanimire Tomov, and Jack Dongarra. 2012. Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems. In Proceedings of the 26th ACM international conference on Supercomputing (ICS ’12). Association for Computing Machinery, New York, NY, USA, 365–376. doi:10. 1145/2304576.2304625

  35. [36]

    Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, and Ninghui Sun. 2011. Fast implementation of DGEMM on Fermi GPU. InProceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’11). Association for Computing Machinery, New York, NY, USA, Article 35, 11 pages. doi:10.1145/20633...

  36. [37]

    2025.Heterogeneous Solvers for Linear Systems with Symmetric Positive-Definite Matrices Using SYCL

    Tim Thüring. 2025.Heterogeneous Solvers for Linear Systems with Symmetric Positive-Definite Matrices Using SYCL. Master Thesis. University of Stuttgart

  37. [38]

    Manasi Tiwari and Sathish Vadhiyar. 2021. Efficient executions of Pipelined Conjugate Gradient Method on Heterogeneous Architectures. doi:10.48550/arXiv. 2105.06176

  38. [39]

    2009.Magma library

    S Tomov, J Dongarra, V Volkov, and J Demmel. 2009.Magma library. https: //icl.utk.edu/projectsfiles/magma/docs/magma.pdf

  39. [40]

    Unified Acceleration (UXL) Foundation. [n. d.].oneAPI Math Library (oneMath). https://github.com/uxlfoundation/oneMath

  40. [41]

    Alexander Van Craen, Marcel Breyer, and Dirk Pflüger. 2022. PLSSVM: A (multi- )GPGPU-accelerated Least Squares Support Vector Machine. In2022 IEEE In- ternational Parallel and Distributed Processing Symposium Workshops (IPDPSW). 818–827. doi:10.1109/IPDPSW55747.2022.00138

  41. [42]

    2007.The science of pro- gramming matrix computations(1 ed.)

    Robert A Van De Geijn and Enrique S Quintana-Ortí. 2007.The science of pro- gramming matrix computations(1 ed.). https://www.cs.utexas.edu/~rvdg/tmp/ TSoPMC.pdf