arxiv: 2604.26889 · v1 · submitted 2026-04-29 · 💻 cs.PF

Recognition: unknown

Revealing NVIDIA Closed-Source Driver Command Streams for CPU-GPU Runtime Behavior Insight

Authors on Pith no claims yet

Pith reviewed 2026-05-07 11:52 UTC · model grok-4.3

classification 💻 cs.PF

keywords NVIDIA GPUclosed-source drivercommand streamsCUDAhardware watchpointperformance analysiskernel driverDMA submission

0 comments

The pith

A hardware watchpoint on the GPU doorbell register recovers the complete command streams from NVIDIA's closed-source userspace driver.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method to recover the hardware command streams that NVIDIA's closed-source driver sends to the GPU during CUDA execution. It does this by using the open-sourced kernel driver to instrument memory mappings and installing a hardware watchpoint on the doorbell register's userspace mapping. This captures commands with full integrity at the exact time of submission. If correct, this visibility would allow analysis of low-level behaviors like data transfer modes and graph launches independently of driver internals.

Core claim

We recover the hardware command streams emitted by NVIDIA's closed-source userspace driver with full integrity by leveraging the recently open-sourced kernel driver, instrumenting the memory-mapping path, and installing a hardware watchpoint on the userspace mapping of the GPU doorbell register. This lets us capture complete command submissions at the moment they are committed. Using this methodology, we present two case studies. For CUDA data movement, we identify the DMA submission modes selected by the driver and characterize their raw hardware performance independently of driver overhead through CUDA-bypassing controlled command issuance. For CUDA Graphs, we show that the reduced launch

What carries the argument

Instrumentation of the memory-mapping path in the open kernel driver together with a hardware watchpoint on the doorbell register to intercept command streams at submission.

If this is right

DMA submission modes for CUDA data movement can be identified and their raw hardware performance characterized without driver overhead.
Reduced launch overhead in newer CUDA releases for graphs is associated with smaller command footprints and more efficient submission patterns.
Command-level visibility provides a practical basis for understanding GPU middleware behavior and improving performance interpretation.
The approach can inform hardware-software co-design for CUDA and related accelerator stacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same instrumentation pattern could be tested on other partially open accelerator drivers to recover their command streams.
Observed command sequences could be replayed in simulators to predict performance for new hardware designs.
This visibility offers a concrete reference for evaluating whether open-source command generators match proprietary efficiency.

Load-bearing premise

The added instrumentation and hardware watchpoint do not change the timing, ordering, or content of command submissions from the closed-source driver.

What would settle it

A direct comparison that shows the captured streams miss submissions, differ in content, or alter timing compared to the driver's actual behavior, verified by an independent capture method.

Figures

Figures reproduced from arXiv: 2604.26889 by Ian Karlin, Ryan Grant, Yuang Yan.

**Figure 1.** Figure 1: CUDA software stack with layer responsibilities view at source ↗

**Figure 2.** Figure 2: Driver command submission path: pushbuffer view at source ↗

**Figure 3.** Figure 3: Synchronization of different copies of GPFIFO val view at source ↗

**Figure 5.** Figure 5: DMA submission paths of CUDA data movement view at source ↗

**Figure 6.** Figure 6: DMA transfer performance across two copy view at source ↗

**Figure 7.** Figure 7: Scaling of CPU submission cost and command view at source ↗

**Figure 8.** Figure 8: Submission patterns of CUDA 11.8 (top) and CUDA view at source ↗

**Figure 9.** Figure 9: Relationship between pushbuffer command size view at source ↗

**Figure 10.** Figure 10: Command size and CPU launch time correlation view at source ↗

read the original abstract

For NVIDIA GPUs, CUDA is the primary interface through which applications orchestrate GPU execution, yet much of the logic that realizes CUDA operations resides in NVIDIA's closed-source userspace driver. As a result, the translation from high-level CUDA APIs to low-level hardware commands remains opaque, limiting both software understanding and performance attribution. This paper makes that command path visible. We recover the hardware command streams emitted by NVIDIA's closed-source userspace driver with full integrity by leveraging the recently open-sourced kernel driver, instrumenting the memory-mapping path, and installing a hardware watchpoint on the userspace mapping of the GPU doorbell register. This lets us capture complete command submissions at the moment they are committed. Using this methodology, we present two case studies. For CUDA data movement, we identify the DMA submission modes selected by the driver and characterize their raw hardware performance independently of driver overhead through CUDA-bypassing controlled command issuance. For CUDA Graphs, we show that the reduced launch overhead in newer CUDA releases is associated with a smaller command footprint and a more efficient submission pattern. Together, these results show that command-level visibility provides a practical basis for understanding and optimizing GPU middleware behavior, improving performance interpretation, and informing future hardware--software co-design for CUDA and related accelerator stacks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete way to capture full NVIDIA closed driver command streams via kernel instrumentation and doorbell watchpoints, with useful case studies, but the non-intrusiveness claim is the weakest part and needs direct validation.

read the letter

The main takeaway is that this work shows how to pull complete, timing-accurate hardware command streams out of NVIDIA's closed userspace driver by instrumenting the open kernel driver's memory-mapping path and setting a hardware watchpoint on the doorbell register. That combination is new compared to earlier reverse-engineering papers and lets them actually see what the driver commits at submission time. The two case studies then put it to use: one breaks down DMA modes for data movement and measures raw hardware costs by bypassing the driver, and the other links smaller command footprints in newer CUDA releases to the lower launch overhead in CUDA Graphs. Those examples are practical and give performance engineers something they can check against their own runs. The approach is grounded in real hardware features rather than fitted models, which is a plus. The soft spot is exactly the one the stress test flags. Adding the watchpoint and mapping instrumentation could introduce latency or ordering changes that the closed driver would not have produced on its own, and because the driver is closed there is no independent baseline to compare against. The abstract does not describe any validation runs that would rule this out, so the 'full integrity' guarantee stays unproven for now. This paper is aimed at systems and performance people in HPC and AI who already work with NVIDIA GPUs and want to attribute costs or tune middleware. A reader who needs to understand command submission patterns will get concrete value from the case studies even if the capture method needs more scrutiny. It is worth sending to peer review because the technique is falsifiable and the results are actionable, but referees should press on the validation experiments and any measured overhead from the instrumentation itself.

Referee Report

1 major / 1 minor

Summary. The paper presents a method to recover the hardware command streams emitted by NVIDIA's closed-source userspace driver with claimed full integrity. It leverages the recently open-sourced kernel driver by instrumenting the memory-mapping path and installing a hardware watchpoint on the userspace mapping of the GPU doorbell register to capture complete command submissions at commit time. Two case studies are provided: one characterizing DMA submission modes and raw hardware performance for CUDA data movement via CUDA-bypassing command issuance, and another showing that reduced launch overhead in newer CUDA releases correlates with smaller command footprints and more efficient submission patterns in CUDA Graphs.

Significance. If the capture method is non-intrusive and the streams are unmodified, the work offers a practical foundation for low-level visibility into proprietary GPU middleware, enabling improved performance interpretation, optimization of CUDA and related stacks, and informed hardware-software co-design. The case studies illustrate concrete benefits in attributing data-movement costs and understanding launch-efficiency changes across CUDA versions.

major comments (1)

[capture methodology (abstract and §3)] The central claim of recovering streams 'with full integrity' rests on the instrumentation of the memory-mapping path and the hardware watchpoint being transparent to the closed-source driver. No direct comparison to an unmodified submission path, no timing measurements of added latency, and no argument ruling out reordering or content changes are provided, leaving the weakest assumption unverified despite the closed-source nature preventing independent checks.

minor comments (1)

[abstract] The abstract and introduction could more explicitly state the scope limitations imposed by the closed-source driver (e.g., inability to compare against ground truth).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying the need for stronger verification of the capture method's transparency. We address the major comment on the capture methodology below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [capture methodology (abstract and §3)] The central claim of recovering streams 'with full integrity' rests on the instrumentation of the memory-mapping path and the hardware watchpoint being transparent to the closed-source driver. No direct comparison to an unmodified submission path, no timing measurements of added latency, and no argument ruling out reordering or content changes are provided, leaving the weakest assumption unverified despite the closed-source nature preventing independent checks.

Authors: We agree that the manuscript would benefit from a more explicit defense of the integrity claim. The kernel instrumentation augments only the mmap path in the recently open-sourced kernel driver to record the virtual address of the doorbell mapping; it performs no modification to the mapping itself, no interception of userspace driver calls, and no alteration of command buffers. The hardware watchpoint is installed solely to trigger on the write to the doorbell register, at which point the command buffer is captured. This occurs after the closed-source userspace driver has completed all command preparation and before the GPU begins execution, so the captured stream matches the driver's emitted commands. Reordering or content changes are precluded because our mechanism does not participate in the write or the preparation phase. We acknowledge the absence of side-by-side unmodified comparisons and explicit latency numbers. In revision we will expand §3 with (i) a dedicated paragraph formalizing why the design rules out interference, (ii) available overhead measurements from our testbed, and (iii) discussion of why a fully unmodified comparison is not feasible without driver source access. These changes will directly address the verification concern. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical instrumentation method with external dependencies

full rationale

The paper describes a systems instrumentation technique that recovers command streams by leveraging the open-sourced NVIDIA kernel driver, memory-mapping instrumentation, and a hardware watchpoint on the doorbell register. No equations, fitted parameters, or first-principles derivations appear in the provided text. The central claim of 'full integrity' recovery rests on an external assumption about non-interference from added instrumentation rather than any self-definition, self-citation chain, or renaming of known results. The two case studies (DMA modes and CUDA Graphs) are observational characterizations enabled by the method, not predictions that reduce to the inputs by construction. This is a standard non-circular empirical systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work depends on the assumption that the open-sourced kernel driver faithfully exposes the memory-mapping behavior used by the closed-source userspace driver and that hardware watchpoints can be installed without side effects.

axioms (1)

domain assumption The recently open-sourced kernel driver accurately exposes the memory-mapping path used by the closed-source userspace driver.
This assumption is required for the instrumentation step to intercept the correct mappings.

pith-pipeline@v0.9.0 · 5526 in / 1197 out tokens · 35840 ms · 2026-05-07T11:52:02.771410+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Tyler Allen, Bennett Cooper, and Rong Ge. 2024. Fine-grain Quantitative Analysis of Demand Paging in Unified Virtual Memory.ACM Trans. Archit. Code Optim. 21, 1, Article 14 (Jan. 2024), 24 pages. doi:10.1145/3632953

work page doi:10.1145/3632953 2024
[2]

Rob Armstrong, Kevin Mittman, and Fred Oh. 2024. NVIDIA Transi- tions Fully Towards Open-Source GPU Kernel Modules. NVIDIA Technical Blog. https://developer.nvidia.com/blog/nvidia-transitions-fully-towards-open- source-gpu-kernel-modules/

2024
[3]

Anderson

Joshua Bakita and James H. Anderson. 2024. Demystifying NVIDIA GPU Internals to Enable Reliable GPU Management. In2024 IEEE 30th Real-Time and Embedded Technology and Applications Symposium (RTAS). 294–305. doi:10.1109/RTAS61025. 2024.00031

work page doi:10.1109/rtas61025 2024
[4]

Bowen, Andrew Case, Ibrahim Baggili, and Golden G

Christopher J. Bowen, Andrew Case, Ibrahim Baggili, and Golden G. Richard
[5]

Forensic Science International: Digital Investigation49 (2024), 301760

A step in a new direction: NVIDIA GPU kernel driver memory forensics. Forensic Science International: Digital Investigation49 (2024), 301760. doi:10.1016/ j.fsidi.2024.301760 DFRWS USA 2024 - Selected Papers from the 24th Annual Digital Forensics Research Conference USA

work page arXiv 2024
[6]

Shulin Fan, Zhichao Hua, Yubin Xia, and Haibo Chen. 2025. XpuTEE: A High- Performance and Practical Heterogeneous Trusted Execution Environment for GPUs.ACM Trans. Comput. Syst.43, 1–2, Article 2 (April 2025), 27 pages. doi:10. 1145/3719653

2025
[7]

Zhongshu Gu, Enriquillo Valdez, Salman Ahmed, Julian James Stephen, Michael Le, Hani Jamjoom, Shixuan Zhao, and Zhiqiang Lin. 2025. NVIDIA GPU Confi- dential Computing Demystified. arXiv:2507.02770 [cs.CR] https://arxiv.org/abs/ 2507.02770

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Houston Hoffman and Fred Oh. 2024. Constant Time Launch for Straight- Line CUDA Graphs and Other Performance Enhancements. NVIDIA Technical Blog. https://developer.nvidia.com/blog/constant-time-launch-for-straight-line- cuda-graphs-and-other-performance-enhancements

2024
[9]

John Hubbard, Gonzalo Brito, Chirayu Garg, Nikolay Sakharnykh, and Fred Oh. 2023. Simplifying GPU Application Development with Heterogeneous Memory Management. NVIDIA Technical Blog. https://developer.nvidia.com/blog/simplifying-gpu-application-development- with-heterogeneous-memory-management/

2023
[10]

Ao Li, Bojian Zheng, Gennady Pekhimenko, and Fan Long. 2022. Automatic hor- izontal fusion for GPU kernels. InProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’22). IEEE Press, 14–27. doi:10.1109/CGO53902.2022.9741270

work page doi:10.1109/cgo53902.2022.9741270 2022
[11]

2024.User-mode Work Submission

Microsoft. 2024.User-mode Work Submission. https://learn.microsoft.com/en- us/windows-hardware/drivers/display/user-mode-work-submission

2024
[12]

NVIDIA. [n. d.].open-gpu-doc: Documentation of NVIDIA chip/hardware interfaces. https://github.com/NVIDIA/open-gpu-doc
[13]

NVIDIA. 2025. Open GPU Kernel Modules. GitHub repository. Version 580.105.08

2025
[14]

2026.CUDA C++ Programming Guide: Asynchronous Execution

NVIDIA Corporation. 2026.CUDA C++ Programming Guide: Asynchronous Execution. https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/ asynchronous-execution.html

2026
[15]

2026.NVIDIA Nsight Systems User Guide

NVIDIA Corporation. 2026.NVIDIA Nsight Systems User Guide. https://docs. nvidia.com/nsight-systems/UserGuide/index.html

2026
[16]

2026.Heterogeneous Memory Management (HMM)

The Linux Kernel community. 2026.Heterogeneous Memory Management (HMM). https://www.kernel.org/doc/html/latest/mm/hmm.html

2026
[17]

Shulai Zhang, Ao Xu, Quan Chen, Han Zhao, Weihao Cui, Zhen Wang, Yan Li, Limin Xiao, and Minyi Guo. 2025. Efficient performance-aware GPU sharing with compatibility and isolation through kernel space interception. InProceedings of the 2025 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC ’25). USENIX Association, USA, Article 59, 17 pages

2025
[18]

Zhenkai Zhang, Tyler Allen, Fan Yao, Xing Gao, and Rong Ge. 2023. TunneLs for Bootlegging: Fully Reverse-Engineering GPU TLBs for Challenging Isolation Guarantees of NVIDIA MIG. InProceedings of the 2023 ACM SIGSAC Conference Revealing NVIDIA Closed-Source Driver Command Streams for CPU–GPU Runtime Behavior Insight xxx, 2026, xxx on Computer and Communica...

work page doi:10.1145/3576915.3616672 2023
[19]

Shixuan Zhao, Zhongshu Gu, Salman Ahmed, Enriquillo Valdez, Hani Jamjoom, and Zhiqiang Lin. 2025. GPU Travelling: Efficient Confidential Collaborative Training with TEE-Enabled GPUs. InProceedings of the 2025 ACM SIGSAC Confer- ence on Computer and Communications Security (CCS ’25). Association for Com- puting Machinery, New York, NY, USA, 2653–2667. doi:...

work page doi:10.1145/3719027.3765029 2025