pith. sign in

arxiv: 2606.06697 · v1 · pith:LK7PSMNMnew · submitted 2026-06-04 · 💻 cs.CR · cs.OS

AgileOS: A GPU Operating System Layer for Protected CUDA Services

Pith reviewed 2026-06-28 00:14 UTC · model grok-4.3

classification 💻 cs.CR cs.OS
keywords CUDA virtualizationGPU securityoperating system layerPTX injectionmemory protectiontrusted executionlibrary interception
0
0 comments X

The pith

AgileOS virtualizes CUDA at the library boundary so a trusted worker can mediate access and protect service state from applications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AgileOS as a GPU operating system layer to provide protection around CUDA services. Modern applications need isolation for device queues, memory-mapped regions, and library state, but standard CUDA exposes these directly to each application. AgileOS achieves this by having applications link to client shims while a trusted worker handles the real CUDA context and uses pointer validation with PTX guards for memory separation. This allows support for services and libraries without ad hoc mechanisms.

Core claim

AgileOS virtualizes CUDA at the library boundary: applications link against client-side CUDA Runtime, Driver, and selected library shims, while a trusted runtime worker owns the real CUDA context and mediates supported operations. To protect service state and module interfaces, AgileOS defines a GPU memory-management model that separates user allocations from protected module/MMIO ranges, using pointer validation and memory access guards via PTX injection.

What carries the argument

The library-boundary virtualization with a trusted runtime worker owning the CUDA context and PTX-injected memory access guards for separating user and protected allocations.

If this is right

  • Applications can interact with protected GPU services without direct access to context or device pointers.
  • Library shims enable compatibility with existing code while routing calls to the worker.
  • The memory model prevents exposure of protected ranges to untrusted kernels.
  • Modular design supports various services and libraries such as cuFFT and PyTorch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This design could enable more secure sharing of GPU resources in multi-user environments.
  • PTX injection might be adaptable for other forms of runtime enforcement in GPU code.
  • Future extensions could include support for additional device interactions like storage or networking.

Load-bearing premise

That applications cannot bypass the client-side shims to access the real CUDA context directly, and that PTX-level guards can enforce memory separation without breaking library compatibility.

What would settle it

An application that successfully accesses protected module state or MMIO regions without mediation by the trusted worker would disprove the isolation claim.

Figures

Figures reproduced from arXiv: 2606.06697 by Alex Jones, Peipei Zhou, Yiyu Shi, Zhuoping Yang.

Figure 1
Figure 1. Figure 1: AgileOS system overview. Untrusted CUDA applications use the [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: PTX-level kernel memory guard; (a) a CUDA kernel in the user program; (b) received PTX code from the compiled user program in AgileOS; [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

Modern GPU applications increasingly interact with storage systems, network devices, vendor libraries, and GPU-resident services rather than executing only isolated compute kernels. This shift creates a need for operating-system-like protection around GPU services, where service metadata, device queues, memory-mapped I/O regions, and library-internal state should not be directly exposed to untrusted application kernels. However, today's CUDA programming model, by default, still gives each application direct ownership of its CUDA context, device pointers, runtime handles, module loading path, and kernel launches, leaving protected GPU services to build their own ad hoc interfaces and isolation mechanisms. This paper presents the initial design and prototype scope of AgileOS, a GPU operating-system layer for protected CUDA services. AgileOS virtualizes CUDA at the library boundary: applications link against client-side CUDA Runtime, Driver, and selected library shims, while a trusted runtime worker owns the real CUDA context and mediates supported operations. To protect service state and module interfaces, AgileOS also defines a GPU memory-management model that separates user allocations from protected module/MMIO ranges, using pointer validation and memory access guards via PTX injection. AgileOS is modularized and flexible, supporting a range of protected services and existing libraries such as cuFFT and PyTorch. The prototype includes client-side interceptors, worker-side CUDA handlers, virtualized CUDA object tables, protected AgileOS modules, a GPU memory manager that separates user allocations from protected module/MMIO ranges, selected trusted library adapters, and the PTX-level kernel memory guard.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents the initial design and prototype scope of AgileOS, a GPU operating-system layer for protected CUDA services. It virtualizes CUDA at the library boundary by having applications link against client-side CUDA Runtime, Driver, and library shims, while a trusted runtime worker owns the real CUDA context and mediates operations. Protection of service state uses a GPU memory-management model separating user allocations from protected module/MMIO ranges via pointer validation and PTX injection guards. The prototype includes client-side interceptors, worker-side handlers, virtualized object tables, protected modules, a memory manager, trusted library adapters, and the PTX-level kernel memory guard. It claims support for libraries such as cuFFT and PyTorch.

Significance. If the protection claims hold, AgileOS could address a real need for OS-like isolation around GPU services in modern applications interacting with storage, networks, and vendor libraries. However, the manuscript supplies no evaluation data, security analysis, performance numbers, or bypass testing, so the significance cannot be assessed beyond the architectural proposal itself.

major comments (2)
  1. [Prototype description] Prototype description (abstract and full design section): The central protection claim rests on the assumption that linking against client-side shims prevents direct CUDA context ownership and that PTX injection reliably separates user allocations from protected ranges. No coverage analysis of CUDA entry points, direct driver handles, inline PTX, or module loading paths is provided, nor any attack surface evaluation.
  2. [Memory management model] Memory management model (abstract): The PTX-level memory access guards are presented as enforcing separation, but the manuscript contains no formal argument, test cases, or compatibility analysis showing that guards catch all loads/stores to protected MMIO/module ranges without false negatives or breaking existing libraries.
minor comments (1)
  1. The abstract and prototype list could benefit from explicit section numbering or a diagram of the client/worker boundary to improve readability of the modular design.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments on the scope of our initial design and prototype. The manuscript focuses on the architecture and implementation of the virtualization layer rather than a complete security evaluation. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Prototype description] Prototype description (abstract and full design section): The central protection claim rests on the assumption that linking against client-side shims prevents direct CUDA context ownership and that PTX injection reliably separates user allocations from protected ranges. No coverage analysis of CUDA entry points, direct driver handles, inline PTX, or module loading paths is provided, nor any attack surface evaluation.

    Authors: We agree that the manuscript lacks a coverage analysis of CUDA entry points, direct driver handles, inline PTX, module loading paths, and attack surface evaluation. This is because the work presents an initial design and prototype scope; a full security analysis and bypass testing are outside the current contribution. We will revise the manuscript to explicitly discuss the assumptions, the limited set of supported entry points in the prototype, and the planned directions for comprehensive analysis. revision: partial

  2. Referee: [Memory management model] Memory management model (abstract): The PTX-level memory access guards are presented as enforcing separation, but the manuscript contains no formal argument, test cases, or compatibility analysis showing that guards catch all loads/stores to protected MMIO/module ranges without false negatives or breaking existing libraries.

    Authors: We acknowledge that the manuscript provides no formal argument, test cases, or compatibility analysis for the PTX guards. The current prototype implements the guards as part of the memory manager, but without the requested verification. We will add a revised section describing the guard insertion mechanism, its intended coverage for loads/stores, and any observed compatibility with the supported libraries (cuFFT, PyTorch), while noting the absence of exhaustive testing. revision: partial

Circularity Check

0 steps flagged

No circularity: architectural design proposal without derivations or fitted predictions

full rationale

The paper presents an architectural proposal for AgileOS, describing a virtualization approach at the CUDA library boundary with PTX injection for memory guards. No equations, parameter fittings, predictions, or derivation chains are present in the abstract or described content. The work relies on design choices and prototype implementation rather than any self-referential mathematical reductions or self-citation load-bearing claims. This is a standard non-finding for systems papers that do not claim first-principles derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that current CUDA exposes service state directly and that mediation plus PTX guards can enforce isolation. No free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption Today's CUDA programming model gives each application direct ownership of its CUDA context, device pointers, runtime handles, module loading path, and kernel launches.
    Stated explicitly in the opening paragraph as the motivation for the work.

pith-pipeline@v0.9.1-grok · 5809 in / 1264 out tokens · 31314 ms · 2026-06-28T00:14:37.324001+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Deep Learning Workload Scheduling in GPU Datacenters: A Survey,

    Z. Ye, W. Gao, Q. Hu, P. Sun, X. Wang, Y . Luo, T. Zhang, and Y . Wen, “Deep Learning Workload Scheduling in GPU Datacenters: A Survey,” ACM Computing Surveys, vol. 56, no. 6, pp. 1–38, 2024

  2. [2]

    Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better,

    G. Menghani, “Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better,”ACM Computing Surveys, vol. 55, no. 12, pp. 1–37, 2023

  3. [3]

    Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems,

    X. Miao, G. Oliaro, Z. Zhang, X. Cheng, H. Jin, T. Chen, and Z. Jia, “Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems,”ACM Computing Surveys, vol. 58, no. 1, pp. 1–37, 2025

  4. [4]

    BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing,

    T. Liu, Y . Chen, D. Li, C. Wu, Y . Zhu, J. He, Y . Peng, H. Chen, H. Chen, and C. Guo, “BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing,” in20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 103–118

  5. [5]

    GraphBLAST: A High- Performance Linear Algebra-based Graph Framework on the GPU,

    C. Yang, A. Buluc ¸, and J. D. Owens, “GraphBLAST: A High- Performance Linear Algebra-based Graph Framework on the GPU,” ACM Transactions on Mathematical Software (TOMS), vol. 48, no. 1, pp. 1–51, 2022

  6. [6]

    gSampler: General and Efficient GPU-based Graph Sampling for Graph Learning,

    P. Gong, R. Liu, Z. Mao, Z. Cai, X. Yan, C. Li, M. Wang, and Z. Li, “gSampler: General and Efficient GPU-based Graph Sampling for Graph Learning,” inProceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 562–578

  7. [7]

    Efficient and Scalable Graph Pattern Mining on GPUs,

    X. Chenet al., “Efficient and Scalable Graph Pattern Mining on GPUs,” in16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022, pp. 857–877

  8. [8]

    The Transformational Role of GPU Computing and Deep Learning in Drug Discovery,

    M. Pandey, M. Fernandez, F. Gentile, O. Isayev, A. Tropsha, A. C. Stern, and A. Cherkasov, “The Transformational Role of GPU Computing and Deep Learning in Drug Discovery,”Nature Machine Intelligence, vol. 4, no. 3, pp. 211–221, 2022

  9. [9]

    Jax-fem: A differentiable gpu-accelerated 3d finite element solver for automatic inverse design and mechanistic data science,

    T. Xue, S. Liao, Z. Gan, C. Park, X. Xie, W. K. Liu, and J. Cao, “Jax-fem: A differentiable gpu-accelerated 3d finite element solver for automatic inverse design and mechanistic data science,”Computer Physics Communications, vol. 291, p. 108802, 2023

  10. [10]

    cuquantum sdk: A high-performance library for accelerating quantum science,

    H. Bayraktar, A. Charara, D. Clark, S. Cohen, T. Costa, Y .-L. L. Fang, Y . Gao, J. Guan, J. Gunnels, A. Haidaret al., “cuquantum sdk: A high-performance library for accelerating quantum science,” in2023 IEEE International Conference on Quantum Computing and Engineering (QCE), vol. 1. IEEE, 2023, pp. 1050–1061

  11. [11]

    GPU-Initiated On-Demand High- Throughput Storage Access in the BaM System Architecture,

    Z. Qureshi, V . S. Mailthody, I. Gelado, S. Min, A. Masood, J. Park, J. Xiong, C. J. Newburn, D. Vainbrand, I.-H. Chung, M. Gar- land, W. Dally, and W.-m. Hwu, “GPU-Initiated On-Demand High- Throughput Storage Access in the BaM System Architecture,” inPro- ceedings of the 28th ACM International Conference on Architectural Support for Programming Languages...

  12. [12]

    GMT: GPU Orchestrated Memory Tiering for the Big Data Era,

    C.-H. Chang, J. Han, A. Sivasubramaniam, V . Sharma Mailthody, Z. Qureshi, and W.-M. Hwu, “GMT: GPU Orchestrated Memory Tiering for the Big Data Era,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2024, pp. 464–478

  13. [13]

    AGILE: Lightweight and Efficient Asynchronous GPU-SSD Integration,

    Z. Yang, J. Zhuang, X. Chen, A. Jones, and P. Zhou, “AGILE: Lightweight and Efficient Asynchronous GPU-SSD Integration,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025, pp. 1028–1042

  14. [14]

    Asynchrony and GPUs: Bridging this Dichotomy for I/O with AGIO,

    J. Han, A. Sivasubramaniam, C.-H. Chang, V . S. Mailthody, Z. Qureshi, and W.-M. Hwu, “Asynchrony and GPUs: Bridging this Dichotomy for I/O with AGIO,” inProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2026, pp. 208–222

  15. [15]

    GeminiFS: A Companion File System for GPUs,

    S. Qiu, W. Liu, Y . Hu, J. Yan, Z. Shen, X. Yao, R. Chen, G. Zhang, and Y . Zhang, “GeminiFS: A Companion File System for GPUs,” in 23rd USENIX Conference on File and Storage Technologies (FAST 25). Santa Clara, CA: USENIX Association, Feb. 2025, pp. 221–236

  16. [16]

    Managing Scalable Direct Storage Accesses for GPUs with GoFS,

    S. Li, Y . E. Zhou, Y . Xue, Y . Xu, and J. Huang, “Managing Scalable Direct Storage Accesses for GPUs with GoFS,” inProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, 2025, pp. 979–995

  17. [17]

    FpgaNIC: An FPGA-based Versatile 100Gb SmartNIC for GPUs,

    Z. Wang, H. Huang, J. Zhang, F. Wu, and G. Alonso, “FpgaNIC: An FPGA-based Versatile 100Gb SmartNIC for GPUs,” in2022 USENIX Annual Technical Conference (USENIX ATC 22), 2022, pp. 967–986

  18. [18]

    Enabling Efficient GPU Communication over Multiple NICs with FuseLink,

    Z. Ren, Y . Li, Z. Wang, X. Huang, W. Li, K. Xu, X. Liao, Y . Sun, B. Liu, H. Tian, J. Zhang, M. Wang, Z. Zhong, G. Liu, Y . Zhang, and K. Chen, “Enabling Efficient GPU Communication over Multiple NICs with FuseLink,” in19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), 2025, pp. 91–108

  19. [19]

    GPUOS: A GPU Operating System Primitive for Transparent Operation Fusion

    Y . Yang, X. Gao, Y . Zhou, Y . Gan, Y . Zheng, and A. Quinn, “GPUOS: A GPU Operating System Primitive for Transparent Operation Fusion,” arXiv preprint arXiv:2604.17861, 2026

  20. [20]

    LithOS: An Operating System for Efficient Machine Learning on GPUs,

    P. H. Coppock, B. Zhang, E. H. Solomon, V . Kypriotis, L. Yang, B. Sharma, D. Schatzberg, T. C. Mowry, and D. Skarlatos, “LithOS: An Operating System for Efficient Machine Learning on GPUs,” in Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, 2025, pp. 1–17

  21. [21]

    Ai and memory wall,

    A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney, and K. Keutzer, “Ai and memory wall,”IEEE Micro, vol. 44, no. 3, pp. 33–39, 2024

  22. [22]

    Zero- infinity: Breaking the gpu memory wall for extreme scale deep learning,

    S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y . He, “Zero- infinity: Breaking the gpu memory wall for extreme scale deep learning,” inProceedings of the international conference for high performance computing, networking, storage and analysis, 2021, pp. 1–14

  23. [23]

    Mlp- offload: Multi-level, multi-path offloading for llm pre-training to break the gpu memory wall,

    A. K. Maurya, M. M. Rafique, F. Cappello, and B. Nicolae, “Mlp- offload: Multi-level, multi-path offloading for llm pre-training to break the gpu memory wall,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025, pp. 1381–1394

  24. [24]

    G10: Enabling an efficient unified gpu memory and storage architecture with smart tensor migrations,

    H. Zhang, Y . Zhou, Y . Xue, Y . Liu, and J. Huang, “G10: Enabling an efficient unified gpu memory and storage architecture with smart tensor migrations,” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023, pp. 395–410

  25. [25]

    Overcoming the memory wall with{CXL- Enabled}{SSDs},

    S.-P. Yang, M. Kim, S. Nam, J. Park, J.-Y . Choi, E. H. Nam, E. Lee, S. Lee, and B. S. Kim, “Overcoming the memory wall with{CXL- Enabled}{SSDs},” in2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023, pp. 601–617

  26. [26]

    Demystifying nccl: An in-depth analysis of gpu communication protocols and algorithms,

    Z. Hu, S. Shen, T. Bonato, S. Jeaugey, C. Alexander, E. Spada, J. Dinan, J. Hammond, and T. Hoefler, “Demystifying nccl: An in-depth analysis of gpu communication protocols and algorithms,” in2025 IEEE Symposium on High-Performance Interconnects (HOTI). IEEE, 2025, pp. 48–59

  27. [27]

    CUDA Graphs,

    NVIDIA Corporation, “CUDA Graphs,”CUDA Programming Guide, Version 13.3, 2026, accessed: 2026-06-03. [Online]. Available: https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topic s/cuda-graphs.html

  28. [28]

    Breakable CUDA Graph,

    SGLang Team, “Breakable CUDA Graph,”SGLang Documentation, 2026, last updated: Jun. 4, 2026. Accessed: 2026-06-03. [Online]. Available: https://sgl-project.github.io/advanced features/breakable cud a graph.html

  29. [29]

    Medusa: Accelerating Serverless LLM Inference with Materialization,

    S. Zeng, M. Xie, S. Gao, Y . Chen, and Y . Lu, “Medusa: Accelerating Serverless LLM Inference with Materialization,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2025, pp. 653–668

  30. [30]

    Multi-Process Service,

    NVIDIA Corporation, “Multi-Process Service,”NVIDIA GPU Management and Deployment Documentation, 2026, accessed: 2026- 06-03. [Online]. Available: https://docs.nvidia.com/deploy/mps/latest/in dex.html

  31. [31]

    NVIDIA Multi-Instance GPU,

    ——, “NVIDIA Multi-Instance GPU,”NVIDIA Technologies, 2026, accessed: 2026-06-03. [Online]. Available: https://www.nvidia.com/e n-us/technologies/multi-instance-gpu/

  32. [32]

    Transparent GPU Sharing in Container Clouds for Deep Learning Workloads,

    B. Wu, Z. Zhang, Z. Bai, X. Liu, and X. Jin, “Transparent GPU Sharing in Container Clouds for Deep Learning Workloads,” in20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 69–85

  33. [33]

    Characterizing network requirements for gpu api remoting in ai applications,

    T. Wang, Z. Chen, X. Wei, J. Gu, R. Chen, and H. Chen, “Characterizing network requirements for gpu api remoting in ai applications,”arXiv preprint arXiv:2401.13354, 2024

  34. [34]

    {gVulkan}: Scalable{GPU}pooling for{Pixel-Grained}rendering in ray tracing,

    Y . Gu, Y . Wang, Y . Sun, Y . Xiang, Y . Jiang, X. Hu, Z. Qi, and H. Guan, “{gVulkan}: Scalable{GPU}pooling for{Pixel-Grained}rendering in ray tracing,” in2024 USENIX Annual Technical Conference (USENIX ATC 24), 2024, pp. 1151–1165

  35. [35]

    Krisp: Enabling kernel-wise right-sizing for spatial partitioned gpu inference servers,

    M. Chow, A. Jahanshahi, and D. Wong, “Krisp: Enabling kernel-wise right-sizing for spatial partitioned gpu inference servers,” in2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2023, pp. 624–637

  36. [36]

    Efficient Performance-Aware GPU Sharing with Compatibility and Isolation through Kernel Space Interception,

    S. Zhang, A. Xu, Q. Chen, H. Zhao, W. Cui, Z. Wang, Y . Li, L. Xiao, and M. Guo, “Efficient Performance-Aware GPU Sharing with Compatibility and Isolation through Kernel Space Interception,” in2025 USENIX Annual Technical Conference (USENIX ATC 25), 2025, pp. 1003–1019

  37. [37]

    GPU Memory Exploitation for Fun and Profit,

    Y . Guo, Z. Zhang, and J. Yang, “GPU Memory Exploitation for Fun and Profit,” in33rd USENIX Security Symposium (USENIX Security 24), 2024, pp. 4033–4050

  38. [38]

    Virtual Memory Management,

    NVIDIA Corporation, “Virtual Memory Management,”CUDA Programming Guide, Version 13.3, 2026, accessed: 2026-06-03. [Online]. Available: https://docs.nvidia.com/cuda/cuda-programming-g uide/04-special-topics/virtual-memory-management.html

  39. [39]

    DONGLE: Direct FPGA-Orchestrated NVMe Storage for HLS,

    Wong, Linus Y and Zhang, Jialiang and Li, Jing, “DONGLE: Direct FPGA-Orchestrated NVMe Storage for HLS,” inProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2023, pp. 3–13

  40. [40]

    DONGLE 2.0: Direct FPGA- Orchestrated NVMe Storage for HLS,

    L. Y . Wong, J. Zhang, and J. Li, “DONGLE 2.0: Direct FPGA- Orchestrated NVMe Storage for HLS,”ACM Transactions on Recon- figurable Technology and Systems, vol. 17, no. 3, pp. 1–32, 2024

  41. [41]

    HiLFS: FPGA-Orchestrated File System for High-Level Synthesis,

    Y . Na, L. Y . Wong, A. DeHon, and J. Li, “HiLFS: FPGA-Orchestrated File System for High-Level Synthesis,” inProceedings of the 2026 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2026, pp. 126–136