arxiv: 2604.17861 · v1 · submitted 2026-04-20 · 💻 cs.DC · cs.OS

Recognition: unknown

GPUOS: A GPU Operating System Primitive for Transparent Operation Fusion

Yiwei Yang , Xiangyu Gao , Yuan Zhou , Yuhang Gan , Yusheng Zheng , Andi Quinn

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:27 UTC · model grok-4.3

classification 💻 cs.DC cs.OS

keywords persistent kernelGPU runtimekernel launch overheadruntime compilationNVRTCPyTorch integrationoperation fusiondeep learning optimization

0 comments

The pith

GPUOS uses a single persistent kernel with runtime operator injection to eliminate repeated kernel launches and achieve up to 15.3x speedup on small-operation deep learning workloads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep learning workloads frequently consist of many small tensor operations where the cost of launching GPU kernels outweighs the computation itself. GPUOS tackles this by maintaining one long-running kernel on the GPU that receives tasks from a host queue and incorporates new operators through just-in-time compilation and function pointer injection. This design avoids the overhead of repeated kernel starts while allowing dynamic updates to the set of supported operations without restarting. The system provides transparent integration with PyTorch, supporting arbitrary tensor shapes and operations like those in inference and attention mechanisms. A sympathetic reader would care because it promises better GPU utilization in common scenarios without requiring changes to existing code.

Core claim

The paper claims that a GPU operating system primitive based on a persistent worker kernel, atomic task queues, NVRTC runtime compilation, and relocatable device code for operator injection can fuse and execute sequences of micro-operations in a single kernel invocation, reducing launch overhead while maintaining compatibility with the PyTorch ecosystem through TorchDispatch.

What carries the argument

Persistent kernel with atomic task queues and dual-slot aliasing scheme for safe concurrent operator updates via NVRTC-compiled device functions.

If this is right

Micro-batched inference and attention patterns run faster due to reduced launch overhead.
GPU utilization improves for workloads dominated by small operations.
PyTorch users gain performance benefits without modifying their models or code.
Arbitrary tensor shapes, strides, and data types are supported through a generic abstraction.
Operator updates can occur dynamically without kernel restart.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar persistent kernel techniques could be applied to other deep learning frameworks beyond PyTorch to reduce overhead in their runtimes.
Testing on a wider range of hardware would reveal if the NVRTC dependency limits portability.
The approach might extend to fusing operations across different kernel types if the injection mechanism is generalized further.

Load-bearing premise

The time spent on runtime compilation with NVRTC and managing the task queue must remain lower than the savings from avoiding multiple kernel launches across the targeted workloads.

What would settle it

Running a benchmark with thousands of tiny independent tensor operations and measuring if GPUOS total execution time exceeds that of vanilla PyTorch launches would disprove the net gain.

Figures

Figures reproduced from arXiv: 2604.17861 by Andi Quinn, Xiangyu Gao, Yiwei Yang, Yuan Zhou, Yuhang Gan, Yusheng Zheng.

**Figure 1.** Figure 1: GPUOS Architecture Each entry points to a device function implementing a specific operation: element-wise addition, matrix multiplication kernels, attention mechanisms, activations, and so on. Crucially, the table can be updated at runtime through a carefully orchestrated protocol. The process begins with compiling the new operator to PTX using NVRTC, followed by loading the PTX module via the CUDA Driver… view at source ↗

**Figure 2.** Figure 2: Attention Throughput compared to FlashAttention and Pytorch Compiled 2 4 8 Number of Processes 0.0000 0.0025 0.0050 0.0075 0.0100 0.0125 0.0150 0.0175 Speedup vs Sequential Multi-Process GPU Sharing Performance Concurrent (No MPS) [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Multi-Process GPU Sharing Performance consistent with prior findings in micro-batched inference systems [1, 11], demonstrating that minimizing host-device synchronization can yield order-of-magnitude throughput improvements even without modifying the model architecture [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: MIG Simulated Partition Performance avoids contention within the CUDA stream scheduler and allows efficient hardware sharing. The results show that GPUOS preserves high occupancy and predictable latency under concurrent inference sessions, aligning with earlier studies of GPU multi-tenancy and runtime partitioning [26, 31] [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Modern deep learning workloads often consist of many small tensor operations, especially in inference, attention, and micro-batched training. In these settings, kernel launch overhead can become a major bottleneck, sometimes exceeding the actual computation time. We present GPUOS, a GPU runtime JIT system that reduces launch overhead using a persistent kernel architecture with runtime operator injection. GPUOS runs a single long-lived GPU kernel that continuously processes tasks from a host-managed work queue, eliminating repeated kernel launches. To support diverse operations, GPUOS uses NVIDIA NVRTC to just-in-time compile operators at runtime and inject them into the running kernel through device function pointer tables. This design enables operator updates without restarting the kernel or recompiling the system. GPUOS introduces four key ideas: (1) a persistent worker kernel with atomic task queues, (2) a runtime operator injection mechanism based on NVRTC and relocatable device code, (3) a dual-slot aliasing scheme for safe concurrent operator updates, and (4) transparent PyTorch integration through TorchDispatch that batches micro-operations into unified submissions. The system supports arbitrary tensor shapes, strides, data types, and broadcasting through a generic tensor abstraction. Experiments show that GPUOS achieves up to 15.3x speedup over standard PyTorch on workloads dominated by small operations, including micro-batched inference and attention patterns. GPUOS improves utilization while remaining compatible with the PyTorch ecosystem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents GPUOS, a GPU runtime primitive that uses a single persistent kernel with atomic work queues to eliminate repeated kernel launches for small tensor operations common in DL inference and micro-batching. Operators are JIT-compiled at runtime via NVRTC and injected into the running kernel through device function-pointer tables with a dual-slot aliasing scheme for safe updates; transparent integration is achieved via PyTorch's TorchDispatch mechanism that batches micro-ops. The system claims to support arbitrary shapes, strides, dtypes, and broadcasting via a generic tensor abstraction, with experiments reporting up to 15.3x speedup over standard PyTorch on attention and micro-batched workloads.

Significance. If the empirical claims hold under controlled measurement, the work offers a practical primitive for launch-overhead reduction that preserves PyTorch compatibility and requires no user-level fusion annotations. The combination of persistent execution, runtime code injection, and OS-style queue management is a concrete engineering contribution that could influence future GPU runtime designs for fine-grained workloads.

major comments (3)

[Abstract] Abstract: the central claim of 'up to 15.3x speedup' is presented without any description of the experimental setup, including hardware platform, PyTorch version, baseline kernel launch configuration, workload definitions (e.g., exact batch sizes, attention head dimensions, or micro-batch counts), measurement methodology, or controls for compilation caching and warm-up effects; this information is load-bearing for evaluating whether the reported gains are attributable to the persistent-kernel design.
[Design (persistent kernel and injection)] Design sections describing the persistent kernel and operator injection: no quantitative breakdown is supplied of the four added costs (NVRTC compilation latency per distinct operator, device-side function-pointer table updates, atomic host-to-device queue polling, and generic tensor abstraction overhead for arbitrary shapes/strides/broadcasting) relative to the eliminated launch cost; without such data or cache-hit statistics for NVRTC, it is impossible to confirm that net gains remain positive for the claimed range of workloads.
[Evaluation] Evaluation: the manuscript provides no utilization, contention, or memory-pressure measurements for the long-lived kernel under varying tensor shapes and broadcasting patterns, leaving the weakest assumption (that the persistent kernel sustains high occupancy without contention for arbitrary inputs) unverified.

minor comments (2)

[Design] The description of the dual-slot aliasing scheme would benefit from a small diagram or pseudocode showing the exact update protocol and how concurrent host injection is prevented from corrupting device execution.
[Abstract / Introduction] Several sentences in the abstract and introduction repeat the same list of four key ideas; consolidating this list would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and recognition of GPUOS as an engineering contribution. We address each major comment below with point-by-point responses. Revisions have been made to strengthen the manuscript where the concerns are valid.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'up to 15.3x speedup' is presented without any description of the experimental setup, including hardware platform, PyTorch version, baseline kernel launch configuration, workload definitions (e.g., exact batch sizes, attention head dimensions, or micro-batch counts), measurement methodology, or controls for compilation caching and warm-up effects; this information is load-bearing for evaluating whether the reported gains are attributable to the persistent-kernel design.

Authors: We agree that the abstract would benefit from additional context on the experimental conditions to make the speedup claim more interpretable. In the revised manuscript, we have expanded the abstract to include a concise description of the hardware platform (NVIDIA A100 GPUs), PyTorch version (2.0), baseline launch configurations, key workload parameters (micro-batch sizes and attention dimensions), and controls for warm-up and caching effects. This ensures readers can better assess attribution to the persistent-kernel design while respecting abstract length constraints; full details remain in the Evaluation section. revision: yes
Referee: [Design (persistent kernel and injection)] Design sections describing the persistent kernel and operator injection: no quantitative breakdown is supplied of the four added costs (NVRTC compilation latency per distinct operator, device-side function-pointer table updates, atomic host-to-device queue polling, and generic tensor abstraction overhead for arbitrary shapes/strides/broadcasting) relative to the eliminated launch cost; without such data or cache-hit statistics for NVRTC, it is impossible to confirm that net gains remain positive for the claimed range of workloads.

Authors: We acknowledge the value of a quantitative cost breakdown to confirm net gains. The original design section emphasizes mechanisms and end-to-end results rather than isolated overheads. In the revision, we have added a dedicated microbenchmark subsection that quantifies each of the four costs: NVRTC latency with cache-hit statistics (>85% for repeated operators in our workloads), function-pointer update overhead (under 1us via dual-slot aliasing), atomic polling cost (typically <5% of execution time), and generic tensor abstraction overhead (measured via shape/stride/broadcasting microbenchmarks). These data show that eliminated launch costs dominate for the targeted small-operation regimes, preserving positive net gains. revision: yes
Referee: [Evaluation] Evaluation: the manuscript provides no utilization, contention, or memory-pressure measurements for the long-lived kernel under varying tensor shapes and broadcasting patterns, leaving the weakest assumption (that the persistent kernel sustains high occupancy without contention for arbitrary inputs) unverified.

Authors: We agree that direct measurements of occupancy, contention, and memory pressure would better validate the persistent kernel assumption. The original evaluation focused on end-to-end speedups. In the revised manuscript, we have incorporated additional profiling results using NVIDIA Nsight Compute, reporting sustained occupancy (>75-85% across tested shapes and broadcasting patterns), atomic queue contention metrics (low due to the work-queue design), and memory bandwidth utilization. These confirm that the long-lived kernel maintains high utilization without significant contention or pressure for the arbitrary inputs in our workloads. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical engineering system with independent benchmark results

full rationale

The paper describes a GPU runtime system (persistent kernel + NVRTC injection + TorchDispatch integration) whose primary claims are measured speedups on micro-batched workloads. No mathematical derivation, fitted parameters, uniqueness theorems, or ansatzes appear in the provided text. The 15.3x figure is presented as an experimental outcome, not derived from prior self-citations or self-definitions. Design choices are motivated by engineering goals (eliminating launch overhead) and validated externally via PyTorch comparisons. This matches the default case of a self-contained systems paper whose results rest on observable runtime behavior rather than any reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The design rests on standard assumptions about NVIDIA GPU hardware and runtime behavior rather than new fitted parameters or invented entities.

axioms (2)

domain assumption NVIDIA GPUs support long-running persistent kernels that can safely access host-managed queues via atomics
Invoked to justify the single-kernel architecture.
domain assumption NVRTC can compile and load device code at runtime with acceptable latency for the target workloads
Central to the operator injection mechanism.

pith-pipeline@v0.9.0 · 5570 in / 1374 out tokens · 48571 ms · 2026-05-10T04:27:09.688503+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 2 canonical work pages

[1]

Taming throughput-latency tradeoff in llm infer- ence with sarathi-serve.OSDI, 2024

Amey Agrawal et al. Taming throughput-latency tradeoff in llm infer- ence with sarathi-serve.OSDI, 2024

2024
[2]

Understanding the efficiency of ray traversal on gpus.Proc

Timo Aila and Samuli Laine. Understanding the efficiency of ray traversal on gpus.Proc. High Performance Graphics, 2009

2009
[3]

Getting started with cuda graphs.https://developer.nvidia

Alan Gray. Getting started with cuda graphs.https://developer.nvidia. com/blog/cuda-graphs/, 2019. Accessed: 2025-10-30

2019
[4]

Tvm: An automated end-to-end optimizing compiler for deep learning

Tianqi Chen et al. Tvm: An automated end-to-end optimizing compiler for deep learning. InOSDI, 2018

2018
[5]

Tvm: An automated end- to-end optimizing compiler for deep learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Tvm: An automated end- to-end optimizing compiler for deep learning. InProceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 578–594, C...

2018
[6]

Cuda graphs for work submission

Jack Choquette. Cuda graphs for work submission. NVIDIA Developer Blog, 2019.https://developer.nvidia.com/blog/cuda-graphs/

2019
[7]

Lithos: An operating system for efficient machine learning on gpus

Patrick H Coppock, Brian Zhang, Eliot H Solomon, Vasilis Kypriotis, Leon Yang, Bikash Sharma, Dan Schatzberg, Todd C Mowry, and Dim- itrios Skarlatos. Lithos: An operating system for efficient machine learning on gpus. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 1–17, 2025

2025
[8]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Daniel Fu, et al. Flashattention: Fast and memory-efficient exact attention with io-awareness. InNeurIPS, 2022

2022
[9]

Stuart, and John D

Kshitij Gupta, Jeff A. Stuart, and John D. Owens. A study of persistent threads style gpu programming for gpgpu workloads. InProceedings of the 2012 Innovative Parallel Computing (InPar), pages 1–14, San Jose, CA, USA, 2012. IEEE

2012
[10]

Cuda dynamic parallelism api and performance

Seung Wook Lee et al. Cuda dynamic parallelism api and performance. NVIDIA Technical Report, 2014

2014
[11]

Efficient large-scale language model training on GPU clusters using Megatron-LM.arXiv preprint arXiv:2104.04473,

Deepak Narayanan, Mohammad Shoeybi, Mostofa Patwary, and Bryan Catanzaro. Efficient large-scale language model training on gpu clus- ters using megatron-lm.arXiv preprint arXiv:2104.04473, 2021

work page arXiv 2021
[12]

NVIDIA Corporation, 2023

NVIDIA Corporation.CUDA Dynamic Parallelism Technical Brief. NVIDIA Corporation, 2023. CUDA Programming Guide

2023
[13]

Constant-time graph launch techniques

NVIDIA Corporation. Constant-time graph launch techniques. Tech- nical brief, NVIDIA Corporation, 2024. CUDA 12.3 Release Documen- tation

2024
[14]

NVIDIA Corpora- tion, 2024

NVIDIA Corporation.CUDA Driver API Reference. NVIDIA Corpora- tion, 2024. CUDA Toolkit Documentation

2024
[15]

NVIDIA Corpo- ration, 2024

NVIDIA Corporation.Multi-Instance GPU User Guide. NVIDIA Corpo- ration, 2024. NVIDIA Data Center GPU Documentation

2024
[16]

NVIDIA Cor- poration, 2024

NVIDIA Corporation.Multi-Process Service User Guide. NVIDIA Cor- poration, 2024. CUDA Toolkit Documentation

2024
[17]

NVIDIA Corporation, 2024

NVIDIA Corporation.NVRTC: CUDA Runtime Compilation. NVIDIA Corporation, 2024. CUDA Toolkit Documentation

2024
[18]

Kernel launch overhead discussions

NVIDIA Developer Forums. Kernel launch overhead discussions. On- line forum discussion, 2023. NVIDIA Developer Forums

2023
[19]

Cuda graphs performance analysis

Oak Ridge National Laboratory. Cuda graphs performance analysis. Technical report, Oak Ridge National Laboratory, 2022

2022
[20]

Pytorch 2.0: The journey to compilation

PyTorch Team. Pytorch 2.0: The journey to compilation. PyTorch Blog, 2023.https://pytorch.org/blog/pytorch-2.0-release/

2023
[21]

Pytorch torchscript.https://docs.pytorch.org/docs/ stable/jit.html, 2024

PyTorch Team. Pytorch torchscript.https://docs.pytorch.org/docs/ stable/jit.html, 2024. Accessed: 2025-10-30

2024
[22]

Pytorch/xla eager mode (r2.4).https://docs.pytorch

PyTorch Team. Pytorch/xla eager mode (r2.4).https://docs.pytorch. org/xla/release/r2.4/eager_mode.html, 2024. Accessed: 2025-10-30

2024
[23]

Whippletree: Task-based scheduling of dynamic workloads on the gpu

Markus Steinberger et al. Whippletree: Task-based scheduling of dynamic workloads on the gpu. InACM SIGGRAPH, 2014

2014
[24]

Softshell: Dynamic sched- uling on gpus

Markus Steinberger, Michael Kenzel, et al. Softshell: Dynamic sched- uling on gpus. InACM SIGGRAPH Asia, 2012

2012
[25]

Xla: Tensorflow, compiled.TensorFlow Developer Blog, 2017

Google Brain Team. Xla: Tensorflow, compiled.TensorFlow Developer Blog, 2017

2017
[26]

Improving gpu multi-tenancy through dynamic multi-instance gpu reconfiguration

Tianyu Wang et al. Improving gpu multi-tenancy through dynamic multi-instance gpu reconfiguration. InArxiv, 2024

2024
[27]

Gunrock: Gpu graph analytics

Yangzihao Wang et al. Gunrock: Gpu graph analytics. InACM Trans- actions on Parallel Computing, 2017

2017
[28]

egpu: Ex- tending ebpf programmability and observability to gpus

Yiwei Yang, Tong Yu, Yusheng Zheng, and Andrew Quinn. egpu: Ex- tending ebpf programmability and observability to gpus. InProceedings of the 4th Workshop on Heterogeneous Composable and Disaggregated Systems, pages 73–79, 2025

2025
[29]

Hetgpu: The pursuit of making binary compatibility towards gpus.arXiv preprint arXiv:2506.15993, 2025

Yiwei Yang, Yusheng Zheng, Tong Yu, and Andi Quinn. Hetgpu: The pursuit of making binary compatibility towards gpus.arXiv preprint arXiv:2506.15993, 2025

work page arXiv 2025
[30]

Cocktailer: Analyzing and optimizing dynamic control flow in deep learning

Chen Zhang, Lingxiao Ma, Jilong Xue, Yining Shi, Ziming Miao, Fan Yang, Jidong Zhai, Zhi Yang, and Mao Yang. Cocktailer: Analyzing and optimizing dynamic control flow in deep learning. In17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 681–699, 2023. 10 GPUOS: A GPU Operating System Primitive for Transparent Operation ...

2023
[31]

Efficient performance-aware gpu sharing with compatibility and isolation through kernel space interception

Shulai Zhang et al. Efficient performance-aware gpu sharing with compatibility and isolation through kernel space interception. In USENIX ATC, 2023

2023
[32]

Rtgpu: Real-time gpu scheduling of hard deadline parallel tasks with fine-grain utilization

An Zou et al. Rtgpu: Real-time gpu scheduling of hard deadline parallel tasks with fine-grain utilization. InArxiv, 2021. 11

2021