arxiv: 2604.16043 · v1 · submitted 2026-04-17 · 💻 cs.DC · cs.PL

Recognition: unknown

Evaluating SYCL as a Unified Programming Model for Heterogeneous Systems

Ami Marowka

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:58 UTC · model grok-4.3

classification 💻 cs.DC cs.PL

keywords SYCLheterogeneous computingcode portabilityparallel programmingHPCmemory managementprogramming modelscross-platform development

0 comments

The pith

SYCL implementations show inconsistencies in memory management and parallelism that undermine cross-platform reliability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

SYCL is intended as a single-source framework to simplify programming for heterogeneous systems with CPUs, GPUs, and accelerators. The paper tests whether it meets goals of code portability, developer productivity, and runtime efficiency by comparing its memory models and kernel abstractions. Benchmarks run on Intel platforms plus synthesized results from other studies reveal differences in behavior between USM and buffer-accessor approaches, and between NDRange and hierarchical kernels. A sympathetic reader would care because these gaps mean developers still face platform-specific tuning even when using a supposedly unified model.

Core claim

The paper claims that SYCL does not deliver consistent cross-platform behavior in its core abstractions, as USM and buffer-accessor memory models produce different results and NDRange and hierarchical parallelism models vary in efficiency and correctness across implementations, exposing limitations that affect reliability and usability.

What carries the argument

Direct comparison of SYCL's Unified Shared Memory (USM) versus buffer-accessor memory models and NDRange versus hierarchical kernel models, evaluated through application benchmarks and literature synthesis.

If this is right

Developers cannot assume seamless portability and must validate code behavior on each target platform.
Standardization efforts need to address variations in memory management to improve consistency.
Productivity gains from SYCL are reduced by the need for platform-specific debugging and tuning.
Future framework updates could prioritize unified runtime behavior to meet original design goals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These findings suggest that unified models face inherent challenges when supporting diverse hardware without vendor-specific extensions.
Similar evaluation methods could be applied to other portable frameworks to identify comparable gaps.
Adoption in production HPC workloads may require supplementary tools for detecting implementation differences.

Load-bearing premise

The chosen benchmarks and illustrative examples are sufficient to represent the full range of real-world HPC application behaviors and SYCL usage patterns.

What would settle it

Running the paper's benchmark suite on additional SYCL implementations such as ComputeCpp and other vendors' hardware would show whether the reported inconsistencies appear consistently or remain limited to the tested Intel platforms.

Figures

Figures reproduced from arXiv: 2604.16043 by Ami Marowka.

**Figure 3.** Figure 3: Comparing Results of Vector-Addition on the Intel [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: Comparing Results of Matrix Multiplication on the [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: Comparing results of Reduction using Local Memory [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 10.** Figure 10: DGEMM on NVIDIA V100 GPU and Intel Cascade Lake [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Conjugate Gradient on NVIDIA P100 and AMD [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

**Figure 12.** Figure 12: The impact of DCP++ compiler on the performance on [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

read the original abstract

High-performance computing (HPC) applications are increasingly executed in heterogeneous environments, introducing new challenges for programming and software portability. SYCL has emerged as a leading model designed to simplify heterogeneous programming and make it more accessible to developers. Intended as a single-source, cross-platform parallel programming framework, SYCL promises portability, productivity, and performance across a variety of architectures. However, these goals have not been consistently defined or realized, leaving developers with varying expectations. This paper addresses this gap by evaluating SYCL from the perspective of application developers. We analyze whether SYCL meets essential criteria for cross-platform development, including code portability, development productivity, and runtime efficiency. Our evaluation draws on benchmarks and illustrative examples and focuses on SYCL's memory management and parallelism abstractions. We provide detailed comparisons between Unified Shared Memory (USM) and buffer-accessor approaches, as well as between NDRange and hierarchical kernel models. In addition to presenting our own benchmark results on Intel platforms, we synthesize findings from recent studies across multiple SYCL implementations and compilers. Our results expose key limitations and inconsistencies in current SYCL implementations and offer insights into the steps needed to improve the framework's reliability and cross-platform usability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a practical but narrow SYCL evaluation that adds Intel benchmarks and synthesis without broad new claims or ironclad generality.

read the letter

The main thing to take away is that SYCL still shows inconsistencies in memory models and kernel abstractions across implementations, backed by the author's Intel runs and a synthesis of other recent work. The paper does not overreach into new theory or fixes but maps out where portability and efficiency fall short in practice. What stands out as useful is the direct side-by-side comparison of USM versus buffer-accessor approaches and NDRange versus hierarchical kernels, with fresh numbers from Intel platforms layered on top of the literature review. This gives developers concrete trade-off data on productivity and runtime behavior that is not just repeated from prior papers. The synthesis step also helps identify patterns that appear across multiple compilers and studies. The work stays grounded in the reported benchmarks and examples without claiming to represent every possible HPC workload. The soft spots are real but not fatal. The experimental details on setup, statistical measures, and workload selection criteria are thin, which makes it harder to judge how far the observed limitations extend beyond the chosen cases. The Intel-only new data is a plus for that platform but leaves the cross-vendor portability points resting mostly on the synthesized studies. If the benchmarks do not cover a wider range of memory patterns or parallelism granularities common in production codes, the conclusions risk being narrower than stated. This paper is for HPC developers already using or considering SYCL, and for researchers tracking the state of heterogeneous models. It is not a foundational result but a timely status check with empirical content. It deserves peer review because the comparisons are concrete and the topic affects real users, even if referees will likely ask for expanded testing and clearer methodology.

Referee Report

2 major / 2 minor

Summary. The paper evaluates SYCL as a single-source programming model for heterogeneous HPC systems by comparing USM versus buffer-accessor memory models and NDRange versus hierarchical kernel models, reports new benchmark results on Intel platforms, synthesizes findings from prior studies across implementations, and concludes that current SYCL versions exhibit limitations and inconsistencies in portability, productivity, and efficiency that must be addressed for reliable cross-platform use.

Significance. If the empirical comparisons and synthesized observations hold under broader validation, the work would provide concrete, developer-oriented guidance on SYCL's practical shortcomings, helping prioritize improvements in memory abstractions and kernel models that affect real heterogeneous workloads.

major comments (2)

[§4] §4 (Benchmark Methodology) and associated tables/figures: the experimental setup lacks explicit hardware details, compiler versions, run counts, statistical significance tests, or exclusion criteria for outliers, making it impossible to determine whether the reported performance differences and inconsistencies are robust or sensitive to configuration choices.
[Abstract, §5] Abstract and §5 (Discussion/Synthesis): the central claim that the results expose 'key limitations and inconsistencies' generalizable to SYCL relies on the assumption that the selected Intel-focused workloads and illustrative examples are representative of diverse HPC memory access patterns, parallelism granularities, and cross-device behaviors; no justification or coverage argument is provided for this representativeness.

minor comments (2)

[§2] Notation for USM/buffer and NDRange/hierarchical variants is introduced without a consolidated table of abbreviations or a clear mapping to SYCL 2020/2023 specification sections.
[§5] Several synthesized findings from prior studies are cited without page numbers or direct quotation of the original performance numbers, complicating verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address the two major comments point by point below and will revise the paper to strengthen the experimental description and the justification of our claims.

read point-by-point responses

Referee: [§4] §4 (Benchmark Methodology) and associated tables/figures: the experimental setup lacks explicit hardware details, compiler versions, run counts, statistical significance tests, or exclusion criteria for outliers, making it impossible to determine whether the reported performance differences and inconsistencies are robust or sensitive to configuration choices.

Authors: We agree that the experimental setup in §4 is insufficiently detailed. In the revised manuscript we will expand this section to specify the exact hardware platforms (Intel CPU and GPU models), compiler versions and build flags (including the Intel oneAPI DPC++ version), the number of repetitions for each benchmark, the statistical reporting method (means and standard deviations), and the outlier exclusion criteria. These additions will allow readers to evaluate the robustness of the observed differences between USM/buffer and NDRange/hierarchical models. revision: yes
Referee: [Abstract, §5] Abstract and §5 (Discussion/Synthesis): the central claim that the results expose 'key limitations and inconsistencies' generalizable to SYCL relies on the assumption that the selected Intel-focused workloads and illustrative examples are representative of diverse HPC memory access patterns, parallelism granularities, and cross-device behaviors; no justification or coverage argument is provided for this representativeness.

Authors: We acknowledge that the manuscript does not provide an explicit coverage argument for the representativeness of the chosen workloads. While the benchmarks illustrate common heterogeneous patterns and we synthesize results from multiple prior studies across SYCL implementations, a dedicated justification is missing. In the revision we will add a short paragraph in §5 that explains how the selected workloads span regular/irregular memory accesses and different parallelism granularities, referencing standard HPC benchmark suites, and we will explicitly state the Intel-centric scope of the new measurements together with the limitations this imposes on generalizability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation without derivations or self-referential reductions

full rationale

The paper is an empirical evaluation of SYCL using benchmarks, illustrative examples, and synthesis of recent studies. It contains no mathematical derivations, equations, parameter fitting, predictions, or uniqueness theorems. Claims about limitations in portability, productivity, and efficiency are supported by reported benchmark results on Intel platforms and external literature synthesis, without reducing to self-definition or fitted inputs by construction. No load-bearing self-citations of the enumerated kinds appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work is purely empirical evaluation of an existing programming model.

pith-pipeline@v0.9.0 · 5500 in / 1030 out tokens · 31044 ms · 2026-05-10T07:58:19.205020+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 5 canonical work pages

[1]

Singularity: Sci- entiﬁc containers for mobility of compute

Kurtzer GM, Sochat V, Bauer MW. Singularity: Sci- entiﬁc containers for mobility of compute. PLoS One. 2017 May 11

2017
[2]

The Khronos SYCL Working Group, SYCL 2020 Speciﬁcation (revision 7) , https://www.khronos.org/registry/SYCL/specs/sycl- 2020/pdf/sycl-2020.pdf

2020
[3]

The Khronos SYCL Working Group, SYCL 2020 Speciﬁcation (revision 10) , https://registry.khronos.org/SYCL/specs/sycl- 2020/pdf/sycl-2020.pdf

2020
[4]

Khronos Group Landing page, https://www.khronos.org/sycl/
[5]

Codeplay Software Ltd, ComputeCpp Community Edition, 2021 , https://developer.codeplay.com

2021
[6]

Aksel Alpay, AdaptiveCPP, 2021,https://github.com/AdaptiveCpp

2021
[7]

A vailable: https://github.com/intel/llvm

Intel oneAPI DPC++/C++ Compiler (DPC++) , 2020, [Online]. A vailable: https://github.com/intel/llvm. 18

2020
[8]

(2020), Data Par- allel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL , Apress

Reinders, J., Ashbaugh, B., Brodman, J., Kinsner, M., Pennycook, J., and Tian, X. (2020), Data Par- allel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL , Apress

2020
[9]

(2023), Data Par- allel C++: Programming Accelerated Systems Using C++ and SYCL , Apress

Reinders, J., Ashbaugh, B., Brodman, J., Kinsner, M., Pennycook, J., and Tian, X. (2023), Data Par- allel C++: Programming Accelerated Systems Using C++ and SYCL , Apress

2023
[10]

(2021), Exploiting Co- execution with OneAPI: Heterogeneity from a Mod- ern Perspective, In: Sousa, L., Roma, N., Tomas, P

Nozal, R., Bosque, J.L. (2021), Exploiting Co- execution with OneAPI: Heterogeneity from a Mod- ern Perspective, In: Sousa, L., Roma, N., Tomas, P. (eds) Euro-Par 2021: Parallel Processing. Euro-Par

2021
[11]

Springer, Cham

Lecture Notes in Computer Science, vol 12820. Springer, Cham
[12]

Chien, I

S. Chien, I. Peng and S. Markidis, Performance Evaluation of Advanced Features in CUDA Uniﬁed Memory, IEEE/ACM Workshop on Memory Cen- tric High Performance Computing (MCHPC), Den- ver, CO, USA, 2019, pp. 50-57

2019
[13]

Jin and J

Z. Jin and J. S. Vetter, Evaluating Uniﬁed Memory Performance in HIP, 2022 IEEE International Par- allel and Distributed Processing Symposium Work- shops (IPDPSW), Lyon, France, 2022, pp. 562-568

2022
[14]

Joube et al

S. Joube et al. 2023 J. Phys.: Conf. Ser. 2438 012018

2023
[15]

Jarzabek, L., Czarnul, P., Performance evaluation of uniﬁed memory and dynamic parallelism for selected parallel CUDA applications , Journal of Supercom- puting 73, 5378-5401, 2017

2017
[16]

and Chen, L., 2020 UVM- Bench: A Comprehensive Benchmark Suite for Re- searching Uniﬁed Virtual Memory in GPUs , arXiv preprint arXiv:2007.09822

Gu, Y., Wu, W., Li, Y. and Chen, L., 2020 UVM- Bench: A Comprehensive Benchmark Suite for Re- searching Uniﬁed Virtual Memory in GPUs , arXiv preprint arXiv:2007.09822

work page arXiv 2020
[17]

Mijic and D

N. Mijic and D. Davidovic, Benchmark DPC++ code and performance portability on hetero- geneous architectures, 2023 46th MIPRO ICT and Electronics Convention (MIPRO), Opatija, Croatia, 2023, pp. 331-337, doi: 10.23919/MIPRO57284.2023.10159832

work page doi:10.23919/mipro57284.2023.10159832 2023
[18]

Marcel Breyer, Alexander Van Craen, and Dirk Pﬂuger. 2023. Performance Evolution of Diﬀerent SYCL Implementations based on the Parallel Least Squares Support Vector Machine Library. In Inter- national Workshop on OpenCL (IWOCL ’23), April 18-20, 2023, Cambridge, United Kingdom

2023
[19]

Wei-Chen Lin, Tom Deakin, and Simon McIntosh- Smith. 2021. On measuring the maturity of SYCL implementations by tracking historical performance improvements. In Workshop on OpenCL. 1-13

2021
[20]

Juan Fumero, Overall Performance of Uniﬁed Shared Memory Types with Level Zero on Intel Integrated GPUs, 2022, https://jjfumero.github.io/posts/2022/05/overall- performance-of-uniﬁed-shared-memory-level-zero/

2022
[21]

Intel Developer Cloud, www.intel.com/content/www/us/en/developer/
[22]

Benchmarking and Extending SYCL Hi- erarchical Parallelism,

T. Deakin, S. McIntosh-Smith, A. Alpay and V. Heuveline, "Benchmarking and Extending SYCL Hi- erarchical Parallelism," 2021 IEEE/ACM Interna- tional Workshop on Hierarchical Parallelism for Ex- ascale Computing (HiPar), St. Louis, MO, USA, 2021, pp. 10-19

2021
[23]

Marcel Breyer, Alexander Van Craen, and Dirk Pﬂuger. 2022. A Comparison of SYCL, OpenCL, CUDA, and OpenMP for Massively Parallel Sup- port Vector Machine Classiﬁcation on Multi-Vendor Hardware. In International Workshop on OpenCL (IWOCL’22). Association for Computing Machinery, New York, NY, USA, Article 2, 1-12

2022
[24]

Aksel Alpay, Balint Soproni, Holger Wunsche,and Vincent Heuveline. 2022. Exploring the possibility of a hipSYCL-based implementation of oneAPI. In International Workshop on OpenCL (IWOCL’22), May 10-12, 2022, Bristol, ACM, New York, NY, USA, 12 pages. United Kingdom, United Kingdom

2022
[25]

arXiv: 2303.14006 ISBN: 9798350397390

Z. Jin and J. S. Vetter, "A Benchmark Suite for Im- proving Performance Portability of the SYCL Pro- gramming Model," 2023 IEEE International Sympo- sium on Performance Analysis of Systems and Soft- ware (ISPASS), Raleigh, NC, USA, 2023, pp. 325- 327, doi: 10.1109/ISPASS57527.2023.00041

work page doi:10.1109/ispass57527.2023.00041 2023
[26]

Hammond and Timothy G

Jeﬀ R. Hammond and Timothy G. Mattson. 2019. Evaluating Data Parallelism in C++ Using the Par- allel Research Kernels. In Proceedings of the Inter- national Workshop on OpenCL (Boston, MA, USA) (IWOCL’19). Association for Computing Machinery, New York, NY, USA, Article 14, 6 pages

2019
[27]

Marcel Breyer, Alexander Van Craen, and Dirk Pﬂuger. 2024. Performance Evaluation of SYCL’s Diﬀerent Data Parallel Kernels. In Proceedings of the 12th International Workshop on OpenCL and SYCL (IWOCL ’24). Association for Computing Machinery, New York, NY, USA, Article 10, 1-4

2024
[28]

Meyer, J., Alpay, A., Hack, S., Froning, H., and Heuveline, V.: Implementation Techniques for SPMD Kernels on CPUs, in: Interna- tional Workshop on OpenCL, IWOCL ’23, ACM, https://doi.org/10.1145/3585341.3585342, 2023

work page doi:10.1145/3585341.3585342 2023
[29]

Andersson, Klaus Steiniger, René Widera, Tapish Narwal, Michael Bussmann, and Stefano Markidis

A. Marowka, "On the Singularity of SYCL," 2025 IEEE International Parallel and Dis- tributed Processing Symposium Workshops 19 (IPDPSW), Milano, Italy, 2025, pp. 913-922, doi: 10.1109/IPDPSW66978.2025.00142. 20

work page doi:10.1109/ipdpsw66978.2025.00142 2025