pith. machine review for the scientific record. sign in

arxiv: 2604.16043 · v1 · submitted 2026-04-17 · 💻 cs.DC · cs.PL

Recognition: unknown

Evaluating SYCL as a Unified Programming Model for Heterogeneous Systems

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:58 UTC · model grok-4.3

classification 💻 cs.DC cs.PL
keywords SYCLheterogeneous computingcode portabilityparallel programmingHPCmemory managementprogramming modelscross-platform development
0
0 comments X

The pith

SYCL implementations show inconsistencies in memory management and parallelism that undermine cross-platform reliability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

SYCL is intended as a single-source framework to simplify programming for heterogeneous systems with CPUs, GPUs, and accelerators. The paper tests whether it meets goals of code portability, developer productivity, and runtime efficiency by comparing its memory models and kernel abstractions. Benchmarks run on Intel platforms plus synthesized results from other studies reveal differences in behavior between USM and buffer-accessor approaches, and between NDRange and hierarchical kernels. A sympathetic reader would care because these gaps mean developers still face platform-specific tuning even when using a supposedly unified model.

Core claim

The paper claims that SYCL does not deliver consistent cross-platform behavior in its core abstractions, as USM and buffer-accessor memory models produce different results and NDRange and hierarchical parallelism models vary in efficiency and correctness across implementations, exposing limitations that affect reliability and usability.

What carries the argument

Direct comparison of SYCL's Unified Shared Memory (USM) versus buffer-accessor memory models and NDRange versus hierarchical kernel models, evaluated through application benchmarks and literature synthesis.

If this is right

  • Developers cannot assume seamless portability and must validate code behavior on each target platform.
  • Standardization efforts need to address variations in memory management to improve consistency.
  • Productivity gains from SYCL are reduced by the need for platform-specific debugging and tuning.
  • Future framework updates could prioritize unified runtime behavior to meet original design goals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These findings suggest that unified models face inherent challenges when supporting diverse hardware without vendor-specific extensions.
  • Similar evaluation methods could be applied to other portable frameworks to identify comparable gaps.
  • Adoption in production HPC workloads may require supplementary tools for detecting implementation differences.

Load-bearing premise

The chosen benchmarks and illustrative examples are sufficient to represent the full range of real-world HPC application behaviors and SYCL usage patterns.

What would settle it

Running the paper's benchmark suite on additional SYCL implementations such as ComputeCpp and other vendors' hardware would show whether the reported inconsistencies appear consistently or remain limited to the tested Intel platforms.

Figures

Figures reproduced from arXiv: 2604.16043 by Ami Marowka.

Figure 1
Figure 1. Figure 1: Reduction using Local Memory and NDRange kernel on [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparing Results of Vector-Addition on the Intel [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparing Results of Matrix Multiplication on the [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparing results of Reduction using Local Memory [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 10
Figure 10. Figure 10: DGEMM on NVIDIA V100 GPU and Intel Cascade Lake [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Conjugate Gradient on NVIDIA P100 and AMD [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The impact of DCP++ compiler on the performance on [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
read the original abstract

High-performance computing (HPC) applications are increasingly executed in heterogeneous environments, introducing new challenges for programming and software portability. SYCL has emerged as a leading model designed to simplify heterogeneous programming and make it more accessible to developers. Intended as a single-source, cross-platform parallel programming framework, SYCL promises portability, productivity, and performance across a variety of architectures. However, these goals have not been consistently defined or realized, leaving developers with varying expectations. This paper addresses this gap by evaluating SYCL from the perspective of application developers. We analyze whether SYCL meets essential criteria for cross-platform development, including code portability, development productivity, and runtime efficiency. Our evaluation draws on benchmarks and illustrative examples and focuses on SYCL's memory management and parallelism abstractions. We provide detailed comparisons between Unified Shared Memory (USM) and buffer-accessor approaches, as well as between NDRange and hierarchical kernel models. In addition to presenting our own benchmark results on Intel platforms, we synthesize findings from recent studies across multiple SYCL implementations and compilers. Our results expose key limitations and inconsistencies in current SYCL implementations and offer insights into the steps needed to improve the framework's reliability and cross-platform usability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates SYCL as a single-source programming model for heterogeneous HPC systems by comparing USM versus buffer-accessor memory models and NDRange versus hierarchical kernel models, reports new benchmark results on Intel platforms, synthesizes findings from prior studies across implementations, and concludes that current SYCL versions exhibit limitations and inconsistencies in portability, productivity, and efficiency that must be addressed for reliable cross-platform use.

Significance. If the empirical comparisons and synthesized observations hold under broader validation, the work would provide concrete, developer-oriented guidance on SYCL's practical shortcomings, helping prioritize improvements in memory abstractions and kernel models that affect real heterogeneous workloads.

major comments (2)
  1. [§4] §4 (Benchmark Methodology) and associated tables/figures: the experimental setup lacks explicit hardware details, compiler versions, run counts, statistical significance tests, or exclusion criteria for outliers, making it impossible to determine whether the reported performance differences and inconsistencies are robust or sensitive to configuration choices.
  2. [Abstract, §5] Abstract and §5 (Discussion/Synthesis): the central claim that the results expose 'key limitations and inconsistencies' generalizable to SYCL relies on the assumption that the selected Intel-focused workloads and illustrative examples are representative of diverse HPC memory access patterns, parallelism granularities, and cross-device behaviors; no justification or coverage argument is provided for this representativeness.
minor comments (2)
  1. [§2] Notation for USM/buffer and NDRange/hierarchical variants is introduced without a consolidated table of abbreviations or a clear mapping to SYCL 2020/2023 specification sections.
  2. [§5] Several synthesized findings from prior studies are cited without page numbers or direct quotation of the original performance numbers, complicating verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address the two major comments point by point below and will revise the paper to strengthen the experimental description and the justification of our claims.

read point-by-point responses
  1. Referee: [§4] §4 (Benchmark Methodology) and associated tables/figures: the experimental setup lacks explicit hardware details, compiler versions, run counts, statistical significance tests, or exclusion criteria for outliers, making it impossible to determine whether the reported performance differences and inconsistencies are robust or sensitive to configuration choices.

    Authors: We agree that the experimental setup in §4 is insufficiently detailed. In the revised manuscript we will expand this section to specify the exact hardware platforms (Intel CPU and GPU models), compiler versions and build flags (including the Intel oneAPI DPC++ version), the number of repetitions for each benchmark, the statistical reporting method (means and standard deviations), and the outlier exclusion criteria. These additions will allow readers to evaluate the robustness of the observed differences between USM/buffer and NDRange/hierarchical models. revision: yes

  2. Referee: [Abstract, §5] Abstract and §5 (Discussion/Synthesis): the central claim that the results expose 'key limitations and inconsistencies' generalizable to SYCL relies on the assumption that the selected Intel-focused workloads and illustrative examples are representative of diverse HPC memory access patterns, parallelism granularities, and cross-device behaviors; no justification or coverage argument is provided for this representativeness.

    Authors: We acknowledge that the manuscript does not provide an explicit coverage argument for the representativeness of the chosen workloads. While the benchmarks illustrate common heterogeneous patterns and we synthesize results from multiple prior studies across SYCL implementations, a dedicated justification is missing. In the revision we will add a short paragraph in §5 that explains how the selected workloads span regular/irregular memory accesses and different parallelism granularities, referencing standard HPC benchmark suites, and we will explicitly state the Intel-centric scope of the new measurements together with the limitations this imposes on generalizability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation without derivations or self-referential reductions

full rationale

The paper is an empirical evaluation of SYCL using benchmarks, illustrative examples, and synthesis of recent studies. It contains no mathematical derivations, equations, parameter fitting, predictions, or uniqueness theorems. Claims about limitations in portability, productivity, and efficiency are supported by reported benchmark results on Intel platforms and external literature synthesis, without reducing to self-definition or fitted inputs by construction. No load-bearing self-citations of the enumerated kinds appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work is purely empirical evaluation of an existing programming model.

pith-pipeline@v0.9.0 · 5500 in / 1030 out tokens · 31044 ms · 2026-05-10T07:58:19.205020+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 5 canonical work pages

  1. [1]

    Singularity: Sci- entific containers for mobility of compute

    Kurtzer GM, Sochat V, Bauer MW. Singularity: Sci- entific containers for mobility of compute. PLoS One. 2017 May 11

  2. [2]

    The Khronos SYCL Working Group, SYCL 2020 Specification (revision 7) , https://www.khronos.org/registry/SYCL/specs/sycl- 2020/pdf/sycl-2020.pdf

  3. [3]

    The Khronos SYCL Working Group, SYCL 2020 Specification (revision 10) , https://registry.khronos.org/SYCL/specs/sycl- 2020/pdf/sycl-2020.pdf

  4. [4]

    Khronos Group Landing page, https://www.khronos.org/sycl/

  5. [5]

    Codeplay Software Ltd, ComputeCpp Community Edition, 2021 , https://developer.codeplay.com

  6. [6]

    Aksel Alpay, AdaptiveCPP, 2021,https://github.com/AdaptiveCpp

  7. [7]

    A vailable: https://github.com/intel/llvm

    Intel oneAPI DPC++/C++ Compiler (DPC++) , 2020, [Online]. A vailable: https://github.com/intel/llvm. 18

  8. [8]

    (2020), Data Par- allel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL , Apress

    Reinders, J., Ashbaugh, B., Brodman, J., Kinsner, M., Pennycook, J., and Tian, X. (2020), Data Par- allel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL , Apress

  9. [9]

    (2023), Data Par- allel C++: Programming Accelerated Systems Using C++ and SYCL , Apress

    Reinders, J., Ashbaugh, B., Brodman, J., Kinsner, M., Pennycook, J., and Tian, X. (2023), Data Par- allel C++: Programming Accelerated Systems Using C++ and SYCL , Apress

  10. [10]

    (2021), Exploiting Co- execution with OneAPI: Heterogeneity from a Mod- ern Perspective, In: Sousa, L., Roma, N., Tomas, P

    Nozal, R., Bosque, J.L. (2021), Exploiting Co- execution with OneAPI: Heterogeneity from a Mod- ern Perspective, In: Sousa, L., Roma, N., Tomas, P. (eds) Euro-Par 2021: Parallel Processing. Euro-Par

  11. [11]

    Springer, Cham

    Lecture Notes in Computer Science, vol 12820. Springer, Cham

  12. [12]

    Chien, I

    S. Chien, I. Peng and S. Markidis, Performance Evaluation of Advanced Features in CUDA Unified Memory, IEEE/ACM Workshop on Memory Cen- tric High Performance Computing (MCHPC), Den- ver, CO, USA, 2019, pp. 50-57

  13. [13]

    Jin and J

    Z. Jin and J. S. Vetter, Evaluating Unified Memory Performance in HIP, 2022 IEEE International Par- allel and Distributed Processing Symposium Work- shops (IPDPSW), Lyon, France, 2022, pp. 562-568

  14. [14]

    Joube et al

    S. Joube et al. 2023 J. Phys.: Conf. Ser. 2438 012018

  15. [15]

    Jarzabek, L., Czarnul, P., Performance evaluation of unified memory and dynamic parallelism for selected parallel CUDA applications , Journal of Supercom- puting 73, 5378-5401, 2017

  16. [16]

    and Chen, L., 2020 UVM- Bench: A Comprehensive Benchmark Suite for Re- searching Unified Virtual Memory in GPUs , arXiv preprint arXiv:2007.09822

    Gu, Y., Wu, W., Li, Y. and Chen, L., 2020 UVM- Bench: A Comprehensive Benchmark Suite for Re- searching Unified Virtual Memory in GPUs , arXiv preprint arXiv:2007.09822

  17. [17]

    Mijic and D

    N. Mijic and D. Davidovic, Benchmark DPC++ code and performance portability on hetero- geneous architectures, 2023 46th MIPRO ICT and Electronics Convention (MIPRO), Opatija, Croatia, 2023, pp. 331-337, doi: 10.23919/MIPRO57284.2023.10159832

  18. [18]

    Marcel Breyer, Alexander Van Craen, and Dirk Pfluger. 2023. Performance Evolution of Different SYCL Implementations based on the Parallel Least Squares Support Vector Machine Library. In Inter- national Workshop on OpenCL (IWOCL ’23), April 18-20, 2023, Cambridge, United Kingdom

  19. [19]

    Wei-Chen Lin, Tom Deakin, and Simon McIntosh- Smith. 2021. On measuring the maturity of SYCL implementations by tracking historical performance improvements. In Workshop on OpenCL. 1-13

  20. [20]

    Juan Fumero, Overall Performance of Unified Shared Memory Types with Level Zero on Intel Integrated GPUs, 2022, https://jjfumero.github.io/posts/2022/05/overall- performance-of-unified-shared-memory-level-zero/

  21. [21]

    Intel Developer Cloud, www.intel.com/content/www/us/en/developer/

  22. [22]

    Benchmarking and Extending SYCL Hi- erarchical Parallelism,

    T. Deakin, S. McIntosh-Smith, A. Alpay and V. Heuveline, "Benchmarking and Extending SYCL Hi- erarchical Parallelism," 2021 IEEE/ACM Interna- tional Workshop on Hierarchical Parallelism for Ex- ascale Computing (HiPar), St. Louis, MO, USA, 2021, pp. 10-19

  23. [23]

    Marcel Breyer, Alexander Van Craen, and Dirk Pfluger. 2022. A Comparison of SYCL, OpenCL, CUDA, and OpenMP for Massively Parallel Sup- port Vector Machine Classification on Multi-Vendor Hardware. In International Workshop on OpenCL (IWOCL’22). Association for Computing Machinery, New York, NY, USA, Article 2, 1-12

  24. [24]

    Aksel Alpay, Balint Soproni, Holger Wunsche,and Vincent Heuveline. 2022. Exploring the possibility of a hipSYCL-based implementation of oneAPI. In International Workshop on OpenCL (IWOCL’22), May 10-12, 2022, Bristol, ACM, New York, NY, USA, 12 pages. United Kingdom, United Kingdom

  25. [25]

    arXiv: 2303.14006 ISBN: 9798350397390

    Z. Jin and J. S. Vetter, "A Benchmark Suite for Im- proving Performance Portability of the SYCL Pro- gramming Model," 2023 IEEE International Sympo- sium on Performance Analysis of Systems and Soft- ware (ISPASS), Raleigh, NC, USA, 2023, pp. 325- 327, doi: 10.1109/ISPASS57527.2023.00041

  26. [26]

    Hammond and Timothy G

    Jeff R. Hammond and Timothy G. Mattson. 2019. Evaluating Data Parallelism in C++ Using the Par- allel Research Kernels. In Proceedings of the Inter- national Workshop on OpenCL (Boston, MA, USA) (IWOCL’19). Association for Computing Machinery, New York, NY, USA, Article 14, 6 pages

  27. [27]

    Marcel Breyer, Alexander Van Craen, and Dirk Pfluger. 2024. Performance Evaluation of SYCL’s Different Data Parallel Kernels. In Proceedings of the 12th International Workshop on OpenCL and SYCL (IWOCL ’24). Association for Computing Machinery, New York, NY, USA, Article 10, 1-4

  28. [28]

    Meyer, J., Alpay, A., Hack, S., Froning, H., and Heuveline, V.: Implementation Techniques for SPMD Kernels on CPUs, in: Interna- tional Workshop on OpenCL, IWOCL ’23, ACM, https://doi.org/10.1145/3585341.3585342, 2023

  29. [29]

    Andersson, Klaus Steiniger, René Widera, Tapish Narwal, Michael Bussmann, and Stefano Markidis

    A. Marowka, "On the Singularity of SYCL," 2025 IEEE International Parallel and Dis- tributed Processing Symposium Workshops 19 (IPDPSW), Milano, Italy, 2025, pp. 913-922, doi: 10.1109/IPDPSW66978.2025.00142. 20