pith. machine review for the scientific record. sign in

arxiv: 2605.01352 · v1 · submitted 2026-05-02 · 💻 cs.OS · cs.AI· cs.DC

Recognition: unknown

VUDA: Breaking CUDA-Vulkan Isolation for Spatial Sharing of Compute and Graphics on the Same GPU

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:58 UTC · model grok-4.3

classification 💻 cs.OS cs.AIcs.DC
keywords CUDAVulkanGPU spatial sharingexecution isolationchannel redirectionpage table graftingembodied AI simulation
0
0 comments X

The pith

VUDA breaks CUDA-Vulkan isolation by redirecting channels and grafting page tables to enable spatial sharing of compute and graphics on one GPU.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that CUDA and Vulkan workloads can execute concurrently on the same GPU instead of alternating in exclusive time slices. Although the two APIs present different abstractions to programmers, their paths meet at a shared channel primitive in the driver, and their virtual address spaces do not overlap. VUDA therefore redirects CUDA streams into Vulkan's scheduling domain and merges the page tables without any address remapping or data copying on the critical path. A thin annotation API lets developers mark co-schedulable streams. In embodied-AI workloads that interleave physics simulation and photorealistic rendering, the approach raises throughput by as much as 85 percent while improving GPU utilization and lowering end-to-end latency.

Core claim

Although CUDA and Vulkan expose different programming abstractions, their execution paths converge to a common channel primitive at the driver and hardware level, and their virtual-address spaces are inherently disjoint. VUDA exploits these facts through channel redirection into Vulkan's scheduling domain and page-table grafting to unify address spaces, thereby allowing spatial parallelism between compute and graphics workloads while eliminating all data copying on the critical path.

What carries the argument

Channel redirection and page-table grafting that unify CUDA and Vulkan execution paths at the driver level.

If this is right

  • Developers can annotate co-schedulable CUDA streams to run concurrently with Vulkan graphics without changing existing kernels.
  • Compute and graphics workloads no longer occupy mutually exclusive time slices, raising overall GPU utilization.
  • End-to-end latency drops for pipelines that alternate simulation and rendering phases.
  • Throughput rises up to 85 percent relative to temporal-sharing baselines on representative embodied-AI tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same redirection-plus-grafting pattern could be applied to other pairs of compute and graphics APIs that share comparable low-level driver primitives.
  • Unified scheduling across APIs might simplify future hardware designs that currently maintain separate scheduling domains.
  • Mixed workloads on mobile or embedded GPUs could see similar utilization gains if the address-space disjointness property holds across those platforms.

Load-bearing premise

CUDA and Vulkan execution paths converge to a common channel primitive at the driver and hardware level with inherently disjoint virtual-address spaces.

What would settle it

Page-table merging that produces address conflicts or channel redirection that still forces exclusive time slices in hardware would show the spatial-sharing mechanism does not work safely without remapping.

Figures

Figures reproduced from arXiv: 2605.01352 by Bin Xu, Haibo Chen, Jinyu Gu, Pengfei Hu, Wenxin Zheng.

Figure 1
Figure 1. Figure 1: SM utilization trace for a ManiSkill data￾generation workload. The simulation phase (green, action sampling+simulation) and the rendering phase (red, render) alternate sequentially, with SM utilization below 10% during simulation and peaking near 70% during rendering. • We identify parallelism opportunities in two foundational embodied-AI scenarios—simulation data generation and RL training—and show that b… view at source ↗
Figure 3
Figure 3. Figure 3: Code modifications for data generation and RL training. Changed lines are marked with + and -. 3 VUDA Programming Interface This section describes how application developers use VUDA. We first elaborate on the parallelism opportunities existing in embodied-AI simulation workloads (Insight 1), then present the VUDA API and show how it is integrated into each rep￾resentative scenario. 3.1 Parallelism Pattern… view at source ↗
Figure 4
Figure 4. Figure 4: Channel redirection. In the default configura￾tion (left), CUDA streams and Vulkan queues are backed by channels in separate TSGs, preventing spatial co-execution. VUDA unplugs a CUDA stream from its native channel and plugs it into a forwarding channel in the Vulkan context (right), so both submission paths share the same TSG. that redirected kernels can resolve CUDA virtual addresses without any data cop… view at source ↗
Figure 5
Figure 5. Figure 5: Page-table grafting. VUDA recursively merges PDEs from CUDA’s page directory into Vulkan’s. Shared sub￾tree pointers (dashed arrows) ensure that PTE-level changes are visible to both contexts without additional propagation. L4 PTEs are omitted. a CUDA kernel dispatched on a Vulkan channel will fault or produce incorrect results. To close this gap, VUDA implements a one-time bootstrap￾ping step per channel:… view at source ↗
Figure 6
Figure 6. Figure 6: Cost of sharing GPU memory between CUDA and Vulkan: page-table grafting vs. the standard export/import path, measured over an increasing number of 2 MB buffers (log–log scale). RM driver modifications. Beyond the UVM module, we make two targeted modifications to the resource manager driver. First, we add a new RM control command that al￾lows user-space to query the virtual address space handle associated w… view at source ↗
Figure 7
Figure 7. Figure 7: Data-generation throughput (k steps/s) across five ManiSkill environments at 640×480 resolution. VUDA enables inter-step pipelining of simulation and rendering, with gains increasing as the batch size grows and the two phases become more balanced. Act+Sim(K-1) Act+Sim(K) Act+Sim(K+1) ⋯ ⋯ Render(K-1) Render(K) Render(K+1) [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: SM utilization trace with VUDA enabled (same workload as [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: VLA-based RL rollout throughput speedup on RTX 6000 Pro with VUDA enabled across parallel￾environment counts [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
read the original abstract

GPU-based simulation environments for embodied AI interleave physics simulation (CUDA) and photorealistic rendering (Vulkan) on a single device. We observe that two foundational scenarios -- simulation data generation and RL training -- can be naturally adapted to execute their simulation and rendering phases concurrently, presenting a significant opportunity to improve GPU utilization through spatial multiplexing. However, a fundamental obstacle we term execution isolation prevents this: CUDA and Vulkan create separate GPU contexts whose channels are bound to different scheduling groups, confining compute and graphics to mutually exclusive time slices. Existing spatial-sharing techniques are limited to the CUDA ecosystem, while temporal-sharing approaches underutilize available resources. This paper presents VUDA, a system that breaks execution isolation to enable spatial parallelism between CUDA compute and Vulkan graphics workloads. VUDA is built on two key observations: although CUDA and Vulkan expose different programming abstractions, their execution paths converge to a common channel primitive at the driver and hardware level; meanwhile, their virtual-address spaces are inherently disjoint, making safe page-table merging feasible without remapping. VUDA exposes a thin API for developers to annotate co-schedulable CUDA streams, and realizes spatial sharing through channel redirection into Vulkan's scheduling domain and page-table grafting to unify address spaces, eliminating all data copying on the critical path. Experiments on representative embodied-AI workloads show that VUDA delivers up to 85% higher throughput than temporal-sharing baselines, while improving GPU utilization and reducing end-to-end latency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes VUDA, a system that breaks CUDA-Vulkan execution isolation to enable spatial sharing of compute (CUDA) and graphics (Vulkan) workloads on a single GPU. It relies on two observations: execution paths converge to a common channel primitive at the driver/hardware level, and virtual-address spaces are inherently disjoint, allowing safe page-table grafting without remapping or data copies. VUDA exposes a thin annotation API for co-schedulable CUDA streams and redirects channels into Vulkan's domain. Experiments on embodied-AI workloads (simulation data generation and RL training) report up to 85% higher throughput than temporal-sharing baselines, along with improved GPU utilization and reduced end-to-end latency.

Significance. If the mechanism is correct and the performance results hold under scrutiny, VUDA would offer a practical way to improve GPU utilization in mixed compute-graphics pipelines common to embodied AI, robotics simulation, and RL training. The work identifies a concrete opportunity to leverage low-level driver convergence rather than remaining confined to single-API spatial sharing or inefficient temporal multiplexing.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (mechanism description): The central claim that CUDA and Vulkan virtual-address spaces are 'inherently disjoint' and converge to a 'common channel primitive' is stated as an observation but receives no verification step, address-space audit, collision-handling logic, or fallback for driver/GPU/version-specific overlaps. This assumption is load-bearing for the correctness of page-table grafting; without it, the redirection could produce invalid mappings or silently fall back to serialization.
  2. [§5] §5 (evaluation): The headline result of 'up to 85% higher throughput' is presented without workload descriptions, hardware details, baseline implementations, number of runs, error bars, or ablation data isolating the contribution of spatial sharing. This absence makes the performance claim impossible to assess or reproduce from the manuscript.
minor comments (1)
  1. The term 'execution isolation' is introduced in the abstract without a concise definition or pointer to the relevant driver-level scheduling groups, which would aid readers outside the immediate GPU-systems community.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (mechanism description): The central claim that CUDA and Vulkan virtual-address spaces are 'inherently disjoint' and converge to a 'common channel primitive' is stated as an observation but receives no verification step, address-space audit, collision-handling logic, or fallback for driver/GPU/version-specific overlaps. This assumption is load-bearing for the correctness of page-table grafting; without it, the redirection could produce invalid mappings or silently fall back to serialization.

    Authors: We appreciate the referee highlighting the need for explicit verification of this load-bearing assumption. The claims are grounded in our examination of driver source code, hardware specifications, and empirical testing across multiple driver versions, which showed convergence to shared channel primitives and disjoint virtual address spaces. To strengthen the manuscript, we will revise §3 to add a dedicated verification subsection. This will describe the address-space audit methodology, document any identified collision risks, and detail the implemented fallback to serialization when overlaps are detected. We believe this addition will address the correctness concerns without altering the core mechanism. revision: yes

  2. Referee: [§5] §5 (evaluation): The headline result of 'up to 85% higher throughput' is presented without workload descriptions, hardware details, baseline implementations, number of runs, error bars, or ablation data isolating the contribution of spatial sharing. This absence makes the performance claim impossible to assess or reproduce from the manuscript.

    Authors: We agree that the current presentation of results in §5 lacks sufficient detail for full assessment and reproducibility. In the revised manuscript we will expand the evaluation section to include complete workload descriptions for the embodied-AI simulation and RL tasks, exact hardware specifications (GPU models, driver versions), references or descriptions of the temporal-sharing baselines, the number of runs performed, error bars or confidence intervals, and ablation experiments that isolate the contributions of channel redirection and page-table grafting. These changes will make the 85% throughput improvement claim verifiable and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: system design and empirical results stand independently of self-referential definitions or fitted inputs.

full rationale

The paper is a systems implementation work with no equations, no fitted parameters renamed as predictions, and no derivation chain that reduces to its own inputs. Key premises (convergence to a common channel primitive and inherently disjoint VA spaces) are presented as observations enabling the design, with throughput and utilization claims resting on described mechanisms plus experimental outcomes. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the provided text. The contribution is self-contained as an engineering artifact evaluated against baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two hardware/driver assumptions stated as observations; no free parameters or new entities are introduced.

axioms (2)
  • domain assumption CUDA and Vulkan execution paths converge to a common channel primitive at the driver and hardware level
    Explicitly listed as one of the two key observations enabling the design.
  • domain assumption CUDA and Vulkan virtual-address spaces are inherently disjoint
    Stated as the second key observation that makes page-table grafting safe without remapping.

pith-pipeline@v0.9.0 · 5578 in / 1210 out tokens · 43810 ms · 2026-05-10T15:58:22.584019+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 10 canonical work pages · 5 internal anchors

  1. [1]

    Introducing helix 02: Full-body autonomy, January 2026

    Figure AI. Introducing helix 02: Full-body autonomy, January 2026. Accessed: 2026-05-01

  2. [2]

    Parallel frame rendering: Trading responsiveness for energy on a mobile gpu

    Jose-Maria Arnau, Joan-Manuel Parcerisa, and Polychronis Xekalakis. Parallel frame rendering: Trading responsiveness for energy on a mobile gpu. InProceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, pages 83–92, 2013

  3. [3]

    𝜋0: A vision-language-action flow model for general robot control, 2026

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xi- aoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. 𝜋0: A vi...

  4. [4]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonza- lez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Haus- man, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashn...

  5. [5]

    Rt-1: Robotics transformer for real-world control at scale, 2023

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Haus- man, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla...

  6. [6]

    Genie: Gener- ative interactive environments, 2024

    Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behba- hani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder ...

  7. [7]

    WorldVLA: Towards Autoregressive Action World Model

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. World- vla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

  8. [8]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong do- main randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  9. [9]

    Coppock, Brian Zhang, Eliot H

    Patrick H. Coppock, Brian Zhang, Eliot H. Solomon, Vasilis Kypriotis, Leon Yang, Bikash Sharma, Dan Schatzberg, Todd C. Mowry, and Dimitrios Skarlatos. Lithos: An operating system for efficient machine learning on gpus. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, SOSP ’25, page 1–17, New York, NY, USA, 2025. Association...

  10. [10]

    Kulkarni, and K

    Aditya Dhakal, Sameer G. Kulkarni, and K. K. Ramakrishnan. GSLICE: controlled spatial sharing of gpus for a scalable inference platform. In Rodrigo Fonseca, Christina Delimitrou, and Beng Chin Ooi, editors, SoCC ’20: ACM Symposium on Cloud Computing, Virtual Event, USA, October 19-21, 2020, pages 492–506. ACM, 2020

  11. [11]

    Boosting GPU virtualization performance with hybrid shadow page tables

    Yaozu Dong, Mochi Xue, Xiao Zheng, Jiajun Wang, Zhengwei Qi, and Haibing Guan. Boosting GPU virtualization performance with hybrid shadow page tables. In Shan Lu and Erik Riedel, editors,Proceedings of the 2015 USENIX Annual Technical Conference, USENIX ATC 2015, July 8-10, Santa Clara, CA, USA, pages 517–528. USENIX Association, 2015

  12. [12]

    Prefillonly: An inference engine for prefill- only workloads in large language model applications

    Kuntai Du, Bowen Wang, Chen Zhang, Yiming Cheng, Qing Lan, Hejian Sang, Yihua Cheng, Jiayi Yao, Xiaoxuan Liu, Yifan Qiao, Ion Stoica, and Junchen Jiang. Prefillonly: An inference engine for prefill- only workloads in large language model applications. In Youjip Won, Youngjin Kwon, Ding Yuan, and Rebecca Isaacs, editors,Proceedings of the ACM SIGOPS 31st S...

  13. [13]

    Bench- marking vulkan vs opengl rendering on low-power edge gpus

    Oscar Ferraz, Paulo Menezes, Vítor Silva, and Gabriel Falcão. Bench- marking vulkan vs opengl rendering on low-power edge gpus. In International Conference on Graphics and Interaction, ICGI 2021, Porto, Portugal, November 4-5, 2021, pages 1–8. IEEE, 2021

  14. [14]

    Helix: A vision-language-action model for humanoid robots

    Figure AI. Helix: A vision-language-action model for humanoid robots. https://www.figure.ai/helix, 2025. Accessed: 2026-03-22

  15. [15]

    Weaver: Efficient multi-llm serving with attention offloading

    Shiwei Gao, Qing Wang, Shaoxun Zeng, Youyou Lu, and Jiwu Shu. Weaver: Efficient multi-llm serving with attention offloading. In Deniz Altinbüken and Ryan Stutsman, editors,Proceedings of the 2025 USENIX Annual Technical Conference, USENIX ATC 2025, Boston, MA, USA, July 7-9, 2025, pages 587–595. USENIX Association, 2025

  16. [16]

    gvulkan: Scalable GPU pooling for pixel- grained rendering in ray tracing

    Yicheng Gu, Yun Wang, Yunfan Sun, Yuxin Xiang, Xuyan Hu, Zheng- wei Qi, and Haibing Guan. gvulkan: Scalable GPU pooling for pixel- grained rendering in ray tracing. In Saurabh Bagchi and Yiying Zhang, editors,Proceedings of the 2024 USENIX Annual Technical Conference, USENIX ATC 2024, Santa Clara, CA, USA, July 10-12, 2024, pages 1151–

  17. [17]

    USENIX Association, 2024

  18. [18]

    Microsecond-scale preemption for concurrent GPU-accelerated DNN inferences

    Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. Microsecond-scale preemption for concurrent GPU-accelerated DNN inferences. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 539–558, Carlsbad, CA, July 2022. USENIX Association

  19. [19]

    Rossbach, and Emmett Witchel

    Tyler Hunt, Zhipeng Jia, Vance Miller, Ariel Szekely, Yige Hu, Christo- pher J. Rossbach, and Emmett Witchel. Telekine: Secure computing with cloud GPUs. In17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 817–833, Santa Clara, CA, February 2020. USENIX Association

  20. [20]

    Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, Vedant Choudhary, Foster Collins, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Maitrayee Dhaka, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine...

  21. [21]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...

  22. [22]

    Wovr: World models as reliable simulators for post-training vla policies with rl.ArXiv, abs/2602.13977, 2026

    Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, Yu Wang, Haoran Li, Chao Yu, and Dongbin Zhao. Wovr: World models as reliable simulators for post-training vla policies with rl. arXiv preprint arXiv:2602.13977, 2026

  23. [23]

    K, Kunal Kumar Sahoo, Amartya Ranjan Saikia, Pranav Gupta, Atharva Vinay Joshi, Priyanshu Pansari, and Yogesh Simmhan

    Prashanthi S. K, Kunal Kumar Sahoo, Amartya Ranjan Saikia, Pranav Gupta, Atharva Vinay Joshi, Priyanshu Pansari, and Yogesh Simmhan. Pagoda: An energy and time roofline study for DNN workloads on edge accelerators.CoRR, abs/2509.20189, 2025

  24. [24]

    Timegraph: GPU scheduling for real-time multi-tasking environments

    Shinpei Kato, Karthik Lakshmanan, Ragunathan Rajkumar, and Yutaka Ishikawa. Timegraph: GPU scheduling for real-time multi-tasking environments. In Jason Nieh and Carl A. Waldspurger, editors,Pro- ceedings of the 2011 USENIX Annual Technical Conference, USENIX ATC 2011, Portland, OR, USA, June 15-17, 2011. USENIX Association, 2011

  25. [25]

    Shinpei Kato, Michael McThrow, Carlos Maltzahn, and Scott A. Brandt. Gdev: First-class GPU resource management in the operating system. In Gernot Heiser and Wilson C. Hsieh, editors,Proceedings of the 2012 USENIX Annual Technical Conference, USENIX ATC 2012, Boston, MA, USA, June 13-15, 2012, pages 401–412. USENIX Association, 2012

  26. [26]

    VK_NV_cuda_kernel_launch extension

    Khronos Group. VK_NV_cuda_kernel_launch extension. https://docs.vulkan.org/refpages/latest/refpages/source/VK_ NV_cuda_kernel_launch.html, 2023. Vulkan API Reference. Accessed: 2026-03-22

  27. [27]

    The vulkan graphics and compute api.https://www

    Khronos Group. The vulkan graphics and compute api.https://www. vulkan.org/, 2026. Accessed: 2026-03-22

  28. [28]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision- language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

  29. [29]

    Openvla: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R San- keti, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. In Pulkit Agrawal, Oliver Kroemer, and...

  30. [30]

    Nikolopoulos, and Cheol-Ho Hong

    Munkyu Lee, Sihoon Seong, Minki Kang, Jihyuk Lee, Gap-Joo Na, In- Geol Chun, Dimitrios S. Nikolopoulos, and Cheol-Ho Hong. Parvagpu: Efficient spatial GPU sharing for large-scale DNN inference in cloud environments. InProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2024, Atlanta, GA, USA, N...

  31. [31]

    MISO: exploiting multi-instance GPU capability on multi- tenant GPU clusters

    Baolin Li, Tirthak Patel, Siddharth Samsi, Vijay Gadepally, and Devesh Tiwari. MISO: exploiting multi-instance GPU capability on multi- tenant GPU clusters. In Ada Gavrilovska, Deniz Altinbüken, and Carsten Binnig, editors,Proceedings of the 13th Symposium on Cloud Computing, SoCC 2022, San Francisco, California, November 7-11, 2022, pages 173–189. ACM, 2022

  32. [32]

    SimpleVLA-RL: Scaling VLA training via reinforcement learning

    Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Yang Zhaohui, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, Dehui Wang, Dingxiang Luo, Yuchen Fan, Youbang Sun, Jia Zeng, Jiangmiao Pang, Shanghang Zhang, Yu Wang, Yao Mu, Bowen Zhou, and Ning Ding. SimpleVLA-RL: Scaling VLA training via reinforcement learning. InThe Fourteenth Internationa...

  33. [33]

    Bullet: Boosting GPU utilization for LLM serving via dynamic spatial-temporal orchestration

    Zejia Lin, Hongxin Xu, Guanyi Chen, Zhiguang Chen, Yutong Lu, and Xianwei Zhang. Bullet: Boosting GPU utilization for LLM serving via dynamic spatial-temporal orchestration. In Benjamin C. Lee, Harry Xu, Mark Silberstein, and Bingyao Li, editors,Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Ope...

  34. [34]

    Lufei Liu, Mohammadreza Saed, Yuan-Hsi Chou, Davit Grigoryan, Tyler Nowicki, and Tor M. Aamodt. Lumibench: A benchmark suite for hardware ray tracing. InIEEE International Symposium on Workload Characterization, IISWC 2023, Ghent, Belgium, October 1-3, 2023, pages 1–14. IEEE, 2023. 13

  35. [35]

    Isaac gym: High performance gpu-based physics simulation for robot learning

    Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur All- shire, Ankur Handa, and Gavriel State. Isaac gym: High performance gpu-based physics simulation for robot learning. InAdvances in Neu- ral Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2021

  36. [36]

    Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

    Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Muñoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, Lukasz Wawrzyniak, Milad Rakhsha, Alain Denzler, Eric Hei- den, Ales Borovicka, Ossama Ahmed, Iretiayo Akinola, Abrar Anwar, Mark T. Carlson, Ji Yuan Feng, Animesh Garg, Renato Gasoto, Li- onel Gulich, Yijie Guo, ...

  37. [37]

    Kelvin K. W. Ng, Henri Maxime Demoulin, and Vincent Liu. Paella: Low-latency model serving with software-defined GPU scheduling. In Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, and Jonathan Mace, editors,Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023, pages 595–610. ACM, 2023

  38. [38]

    Cosmos world foundation model platform for physical ai, 2025

    NVIDIA, :, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huff- man, Pooya Jannaty, ...

  39. [39]

    GR00T N1: An open foundation model for generalist humanoid robots

    NVIDIA, Johan Bjorck, Nikita Cherniadev Fernando Castañeda, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You L...

  40. [40]

    CUDA green contexts.https: //docs.nvidia.com/cuda/cuda-programming-guide/04-special- topics/green-contexts.html, 2024

    NVIDIA Corporation. CUDA green contexts.https: //docs.nvidia.com/cuda/cuda-programming-guide/04-special- topics/green-contexts.html, 2024. CUDA C++ Programming Guide. Accessed: 2026-03-22

  41. [41]

    NVIDIA open GPU kernel modules.https:// github.com/NVIDIA/open-gpu-kernel-modules, 2024

    NVIDIA Corporation. NVIDIA open GPU kernel modules.https:// github.com/NVIDIA/open-gpu-kernel-modules, 2024. Accessed: 2026- 03-22

  42. [42]

    Cuda c++ programming guide.https://docs

    NVIDIA Corporation. Cuda c++ programming guide.https://docs. nvidia.com/cuda/cuda-c-programming-guide/, 2026. Accessed: 2026- 03-22

  43. [43]

    Multi-Instance GPU (MIG).https://docs.nvidia

    NVIDIA Corporation. Multi-Instance GPU (MIG).https://docs.nvidia. com/datacenter/tesla/mig-user-guide/, 2026. Accessed: 2026-03-22

  44. [44]

    Multi-Process Service (MPS).https://docs

    NVIDIA Corporation. Multi-Process Service (MPS).https://docs. nvidia.com/deploy/mps/, 2026. Accessed: 2026-03-22

  45. [45]

    CHOPIN: scalable graphics rendering in multi-gpu systems via parallel image composition

    Xiaowei Ren and Mieszko Lis. CHOPIN: scalable graphics rendering in multi-gpu systems via parallel image composition. InIEEE Interna- tional Symposium on High-Performance Computer Architecture, HPCA 2021, Seoul, South Korea, February 27 - March 3, 2021, pages 709–722. IEEE, 2021

  46. [46]

    Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel

    Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel. Ptask: operating system abstractions to manage gpus as compute devices. In Ted Wobber and Peter Druschel, editors, Proceedings of the 23rd ACM Symposium on Operating Systems Prin- ciples 2011, SOSP 2011, Cascais, Portugal, October 23-26, 2011, pages 233–248. ACM, 2011

  47. [47]

    Introducing gwm-1, December 2025

    Runway. Introducing gwm-1, December 2025. Accessed: 2026-05-01

  48. [48]

    Habitat: A platform for embodied ai research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9339–9347, 2019

  49. [49]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  50. [50]

    Embodied ai and the rise of humanoid robots.https://www.morganstanley.com/im/publication/insights/ articles/article_humanoid-robots_a4.pdf, 2026

    Morgan Stanley. Embodied ai and the rise of humanoid robots.https://www.morganstanley.com/im/publication/insights/ articles/article_humanoid-robots_a4.pdf, 2026. Accessed: 2026-03-22

  51. [51]

    Yusuke Suzuki, Shinpei Kato, Hiroshi Yamada, and Kenji Kono. Gpuvm: Why not virtualizing gpus at the hypervisor? In Garth Gibson and Nickolai Zeldovich, editors,Proceedings of the 2014 USENIX Annual Technical Conference, USENIX ATC 2014, Philadelphia, PA, USA, June 19-20, 2014, pages 109–120. USENIX Association, 2014

  52. [52]

    Serving DNN models with multi-instance gpus: A case of the reconfigurable machine scheduling problem.CoRR, abs/2109.11067, 2021

    Cheng Tan, Zhichao Li, Jian Zhang, Yu Cao, Sikai Qi, Zherui Liu, Yibo Zhu, and Chuanxiong Guo. Serving DNN models with multi-instance gpus: A case of the reconfigurable machine scheduling problem.CoRR, abs/2109.11067, 2021

  53. [53]

    Maniskill3: Gpu paral- lelized robotics simulation and rendering for generalizable embodied ai.Robotics: Science and Systems, 2025

    Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse kai Chan, Yuan Gao, Xuanlin Li, Tongzhou Mu, Nan Xiao, Arnav Gurha, Viswesh Na- gaswamy Rajesh, Yong Woo Choi, Yen-Ru Chen, Zhiao Huang, Roberto Calandra, Rui Chen, Shan Luo, and Hao Su. Maniskill3: Gpu paral- lelized robotics simulation a...

  54. [54]

    Alex Hofer, Jan Humplik, Atil Iscen, Mithun George Jacob, Deepali Jain, Ryan Julian, Dmitry Kalashnikov, M

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean- Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, Steven Bohez, Konstantinos Bousmalis, Anthony Brohan, Thomas Buschmann, Arunkumar Byravan, Serkan Cabi, Ken Caluwaerts, Federico Casarini, Oscar Chang, Jose Enrique...

  55. [55]

    Gen-1: Scaling embodied foundation models to mastery.Generalist AI Blog, 2026

    Generalist AI Team. Gen-1: Scaling embodied foundation models to mastery.Generalist AI Blog, 2026. https://generalistai.com/blog/apr- 02-2026-GEN-1

  56. [56]

    Tesla optimus: Humanoid robot.https://www.teslarati

    Tesla, Inc. Tesla optimus: Humanoid robot.https://www.teslarati. com/tesla-optimus-job-listings/, 2025. Accessed: 2026-03-22

  57. [57]

    A full GPU virtu- alization solution with mediated pass-through

    Kun Tian, Yaozu Dong, and David Cowperthwaite. A full GPU virtu- alization solution with mediated pass-through. In Garth Gibson and Nickolai Zeldovich, editors,Proceedings of the 2014 USENIX Annual Technical Conference, USENIX ATC 2014, Philadelphia, PA, USA, June 19-20, 2014, pages 121–132. USENIX Association, 2014

  58. [58]

    Unitree H1: Full-size humanoid robot.https://www

    Unitree Robotics. Unitree H1: Full-size humanoid robot.https://www. unitree.com/h1, 2025. Accessed: 2026-03-22

  59. [59]

    Characterizing network requirements for gpu api remot- ing in ai applications, 2024

    Tianxia Wang, Zhuofu Chen, Xingda Wei, Jinyu Gu, Rong Chen, and Haibo Chen. Characterizing network requirements for gpu api remot- ing in ai applications, 2024

  60. [60]

    Phoenixos: Concurrent os-level gpu checkpoint and restore with validated speculation

    Xingda Wei, Zhuobin Huang, Tianle Sun, Yingyi Hao, Rong Chen, Mingcong Han, Jinyu Gu, and Haibo Chen. Phoenixos: Concurrent os-level gpu checkpoint and restore with validated speculation. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Sys- tems Principles, SOSP ’25, page 996–1013, New York, NY, USA, 2025. Association for Computing Machinery

  61. [61]

    Phoenixos: Concurrent os-level GPU checkpoint and restore with validated speculation

    Xingda Wei, Zhuobin Huang, Tianle Sun, Yingyi Hao, Rong Chen, Mingcong Han, Jinyu Gu, and Haibo Chen. Phoenixos: Concurrent os-level GPU checkpoint and restore with validated speculation. In Youjip Won, Youngjin Kwon, Ding Yuan, and Rebecca Isaacs, editors, Proceedings of the ACM SIGOPS 31st Symposium on Operating Sys- tems Principles, SOSP 2025, Lotte Ho...

  62. [62]

    Marble: A multimodal world model, November 2025

    World Labs team. Marble: A multimodal world model, November 2025. Accessed: 2026-05-01

  63. [63]

    Trans- parent GPU sharing in container clouds for deep learning workloads

    Bingyang Wu, Zili Zhang, Zhihao Bai, Xuanzhe Liu, and Xin Jin. Trans- parent GPU sharing in container clouds for deep learning workloads. In20th USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 23), pages 69–85, Boston, MA, April 2023. USENIX Association

  64. [64]

    StreamBox: A lightweight GPU SandBox for serverless inference workflow

    Hao Wu, Yue Yu, Junxiao Deng, Shadi Ibrahim, Song Wu, Hao Fan, Ziyue Cheng, and Hai Jin. StreamBox: A lightweight GPU SandBox for serverless inference workflow. In2024 USENIX Annual Technical Conference (USENIX ATC 24), pages 59–73, Santa Clara, CA, July 2024. USENIX Association

  65. [65]

    OS rendering service made parallel with out-of-order execution and in-order commit

    Yuanpei Wu, Chao Xu, Yubin Xia, Yang Yu, Ming Fu, Binyu Zang, and Haibo Chen. OS rendering service made parallel with out-of-order execution and in-order commit. In Lidong Zhou and Yuanyuan Zhou, editors,19th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2025, Boston, MA, USA, July 7-9, 2025, pages 693–710. USENIX Association, 2025

  66. [66]

    Chang, Leonidas J

    Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, Li Yi, Angel X. Chang, Leonidas J. Guibas, and Hao Su. SAPIEN: A simulated part- based interactive environment. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

  67. [67]

    Aegaeon: Effective GPU pooling for concurrent LLM serving on the market

    Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. Aegaeon: Effective GPU pooling for concurrent LLM serving on the market. In Youjip Won, Youngjin Kwon, Ding Yuan, and Rebecca Isaacs, ed- itors,Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, SOSP 2025, Lotte...

  68. [68]

    Gandiva: In- trospective cluster scheduling for deep learning

    Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. Gandiva: In- trospective cluster scheduling for deep learning. In Andrea C. Arpaci- Dusseau and Geoff Voelker, editors,13th USENIX Symposium on Op- erating Systems Design and Imple...

  69. [69]

    Antman: Dynamic scaling on GPU clusters for deep learning

    Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. Antman: Dynamic scaling on GPU clusters for deep learning. In14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020, Virtual Event, November 4-6, 2020, pages 533–548. USENIX Association, 2020

  70. [70]

    igniter: Interference-aware GPU resource provision- ing for predictable DNN inference in the cloud.IEEE Trans

    Fei Xu, Jianian Xu, Jiabin Chen, Li Chen, Ruitao Shang, Zhi Zhou, and Fangming Liu. igniter: Interference-aware GPU resource provision- ing for predictable DNN inference in the cloud.IEEE Trans. Parallel Distributed Syst., 34(3):812–827, 2023

  71. [71]

    gscale: Scaling up GPU virtualization with dynamic sharing of graphics memory space

    Mochi Xue, Kun Tian, Yaozu Dong, Jiacheng Ma, Jiajun Wang, Zheng- wei Qi, Bingsheng He, and Haibing Guan. gscale: Scaling up GPU virtualization with dynamic sharing of graphics memory space. In Ajay Gulati and Hakim Weatherspoon, editors,Proceedings of the 2016 USENIX Annual Technical Conference, USENIX ATC 2016, Denver, CO, USA, June 22-24, 2016, pages 5...

  72. [72]

    Robochallenge: Large-scale real-robot evaluation of embodied policies, 2025

    Adina Yakefu, Bin Xie, Chongyang Xu, Enwen Zhang, Erjin Zhou, Fan Jia, Haitao Yang, Haoqiang Fan, Haowei Zhang, Hongyang Peng, Jing Tan, Junwen Huang, Kai Liu, Kaixin Liu, Kefan Gu, Qinglun Zhang, Ruitao Zhang, Saike Huang, Shen Cheng, Shuaicheng Liu, Tiancai Wang, Tiezhen Wang, Wei Sun, Wenbin Tang, Yajun Wei, Yang Chen, Youqiang Gui, Yucheng Zhao, Yunch...

  73. [73]

    Preba: A hardware/software co-design for multi-instance gpu based ai inference servers, 2024

    Gwangoo Yeo, Jiin Kim, Yujeong Choi, and Minsoo Rhu. Preba: A hardware/software co-design for multi-instance gpu based ai inference servers, 2024

  74. [74]

    Rlinf-vla: A unified and efficient framework for vla+ rl training.arXiv preprint arXiv:2510.06710, 2025

    Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, et al. Rlinf-vla: A unified and efficient framework for vla+ rl training.arXiv preprint arXiv:2510.06710, 2025

  75. [75]

    Matrix-game: Interactive world foundation model, 2025

    Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, and Yahui Zhou. Matrix-game: Interactive world foundation model, 2025

  76. [76]

    SGDRC: software-defined dynamic resource control for concurrent DNN inference on NVIDIA gpus

    Yongkang Zhang, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yunzhe Li, Zhifeng Jiang, Yang Li, Xiaowen Chu, and Huaicheng Li. SGDRC: software-defined dynamic resource control for concurrent DNN inference on NVIDIA gpus. InProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel 15 Programming, PPoPP 2025, Las Vegas,...

  77. [77]

    Lau, Yuqi Wang, Yifan Xiong, and Bin Wang

    Hanyu Zhao, Zhenhua Han, Zhi Yang, Quanlu Zhang, Fan Yang, Li- dong Zhou, Mao Yang, Francis C.M. Lau, Yuqi Wang, Yifan Xiong, and Bin Wang. HiveD: Sharing a GPU cluster for deep learning with guarantees. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 515–532. USENIX Association, November 2020

  78. [78]

    Tally: Non- intrusive performance isolation for concurrent deep learning work- loads

    Wei Zhao, Anand Jayarajan, and Gennady Pekhimenko. Tally: Non- intrusive performance isolation for concurrent deep learning work- loads. In Lieven Eeckhout, Georgios Smaragdakis, Kaitai Liang, Adrian Sampson, Martha A. Kim, and Christopher J. Rossbach, editors,Pro- ceedings of the 30th ACM International Conference on Architectural Support for Programming ...

  79. [79]

    Muxflow: Efficient and safe gpu sharing in large-scale production deep learning clusters, 2023

    Yihao Zhao, Xin Liu, Shufan Liu, Xiang Li, Yibo Zhu, Gang Huang, Xuanzhe Liu, and Xin Jin. Muxflow: Efficient and safe gpu sharing in large-scale production deep learning clusters, 2023

  80. [80]

    Genesis: A generative and universal physics engine for robotics and beyond.arXiv preprint arXiv:2409.15824, 2024

    Xian Zhou, Theophile Luo, Zhaoting Ren, Hao Li, Zhiao Luo, Jian’an Ye, Haoyu Liu, Jiangmiao Li, Zhiwei Zhong, He Wang, and Hao Su. Genesis: A generative and universal physics engine for robotics and beyond.arXiv preprint arXiv:2409.15824, 2024. 16