pith. sign in

arxiv: 2603.15042 · v3 · submitted 2026-03-16 · 💻 cs.DC · cs.OS

Performance Isolation and Semantic Determinism in Efficient GPU Spatial Sharing

Pith reviewed 2026-05-15 10:30 UTC · model grok-4.3

classification 💻 cs.DC cs.OS
keywords GPU spatial sharingperformance isolationsemantic determinismGPU coroutinemulti-tenant GPUtraining throughputinference latencykernel semantics
0
0 comments X

The pith

CoGPU uses GPU coroutines to share GPUs spatially while preserving exact kernel semantics, isolation, and zero token mismatch.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

GPU spatial sharing has long forced a tradeoff: hardware partitioning wastes resources, multiplexing causes interference, and kernel slicing alters floating-point orders enough to break model outputs. CoGPU decouples logical virtual contexts from physical hardware through a new GPU coroutine abstraction. It performs lightweight cooperative migration to map immutable contexts onto mutable resources without changing kernel behavior or reduction sequences. This combination delivers high utilization, strong isolation between tenants, and absolute semantic determinism. Experiments show up to 79.2% higher training throughput than temporal sharing and 15.1% lower P99 inference latency, plus support for custom scheduling policies.

Core claim

CoGPU resolves the three-way tradeoff in GPU spatial sharing by introducing GPU coroutines that enable dynamic mapping of immutable virtual contexts to mutable physical resources via lightweight cooperative migration, thereby achieving high utilization, strong performance isolation, and absolute semantic determinism that guarantees zero token mismatch across co-located workloads.

What carries the argument

GPU coroutine, an abstraction for logical-to-physical resource decoupling that uses lightweight cooperative migration to preserve exact kernel semantics and floating-point reduction orders.

Load-bearing premise

Lightweight cooperative migration between immutable virtual contexts and mutable physical resources can always preserve exact kernel semantics and floating-point reduction orders across diverse workloads without hidden interference or overhead.

What would settle it

Run the same generative model in isolation and then again while co-located with other workloads on CoGPU; any token output difference would disprove the zero-mismatch claim.

Figures

Figures reproduced from arXiv: 2603.15042 by Haibo Chen, Mingyu Li, Wenxin Zheng, Zhenyuan Yang.

Figure 1
Figure 1. Figure 1: A comparison of existing GPU sharing approaches. • Hardware Partitioning (e.g., MIG [15]): Static and coarse￾grained resource boundaries trap compute capacity within isolated silos, leaving massive residual resources under￾utilized (failing O1). • Hardware Multiplexing (e.g., NVIDIA MPS [14]): A lack of fine-grained control to dynamically prioritize latency-critical requests leads to unpredictable perfor￾m… view at source ↗
Figure 3
Figure 3. Figure 3: The cascading effect of numerical deviations in LLM auto-regressive decoding (Greedy Sampling). the SMs are saturated by opaque, non-preemptible back￾ground kernels, destroying performance isolation [53]. • Software is Constrained (SLO-Aware): Conversely, while software frameworks can perceive application SLOs, they are bottlenecked by the rigid coupling between logical execution and physical resources. On… view at source ↗
Figure 4
Figure 4. Figure 4: The overall architecture of DetShare. The remainder of this section unpacks how DetShare turns this philosophy into a practical system. We first in￾troduce the GPU Coroutine abstraction, the foundation for our logical-physical separation (§4.1). Next, we detail the runtime mechanisms for dynamic binding and cooperative preemption (§4.2). To prevent migration overheads from eroding performance, we then pres… view at source ↗
Figure 5
Figure 5. Figure 5: An example of Remapping Event and pCtx migration. interrupts, DetShare transparently injects a lightweight Res￾ident Control Kernel (RCK) into each active pCtx. Operating as a device-side signal handler, the RCK is a persistent, single￾threaded kernel that consumes negligible resources (<0.1% SM occupancy). When the global scheduler dictates preemption, it asserts a flag in a shared memory region mapped to… view at source ↗
Figure 6
Figure 6. Figure 6: Normalized throughput of colocated model training tasks (higher is better). DetShare consistently outperforms baselines, particularly in scenarios with high resource contention or varying interference patterns. consistently achieves superior aggregate throughput by ef￾fectively balancing SM utilization and minimizing inter-job interference. High Compute Contention (Configs A–D). These con￾figurations repre… view at source ↗
Figure 8
Figure 8. Figure 8: Semantic Determinism Evaluation. DetShare per￾fectly preserves numerical outputs and guarantees zero token drift under multi-tenant interference. 6.2 Semantic Determinism Evaluation Experimental Setup. To rigorously evaluate whether Det￾Share can preserve the exact outputs of LLM inference in multi-tenant environments, we construct a microbenchmark isolating the LM Head (GEMM) and Softmax layers. These two… view at source ↗
Figure 7
Figure 7. Figure 7: Performance comparison on mixed workloads. (a) DNN training throughput (higher is better) and (b) LLM inference p99 latency (lower is better) under 4 configurations. DetShare balances both efficiency and SLOs. Guaranteeing Inference SLOs. For latency-critical LLM workloads, DetShare achieves the lowest 99th percentile (p99) latency across all configurations. In Config A, Det￾Share reduces the p99 latency t… view at source ↗
Figure 9
Figure 9. Figure 9: End-to-end evaluation on Azure, LongBench, and BurstGPT traces: Throughput (higher is better), Latency Breakdown and SLO Violations (lower is better). DetShare reduces average and tail latency while maintaining throughput. Adopting the TPOT-First strategy further decreases decoding tail latency. custom scheduling policies can be seamlessly integrated to meet diverse production demands. 6.3.1 Experimental S… view at source ↗
Figure 10
Figure 10. Figure 10: Overhead analysis. DetShare incurs only 4% and 12% overhead for context switching and preemption, respectively, normalized to the exclusive execution. Analysis. We attribute this to a necessary scheduling trade￾off. Under extreme burstiness, the TPOT-First policy proac￾tively delays the scheduling of new prefill requests (increas￾ing TTFT) to reserve compute and memory bandwidth for ongoing decoding phase… view at source ↗
read the original abstract

Existing GPU spatial sharing systems face a three-way tradeoff: resource utilization, performance isolation, and semantic determinism. Hardware partitioning suffers from hardware under-utilization. Hardware multiplexing fails to avoid performance interference. Recently proposed software-based GPU kernel slicing reshapes floating-point reduction orders, destroying semantic determinism and inducing catastrophic token drift in generative models. We present CoGPU, a transparent spatial sharing system that resolves this trilemma. CoGPU introduces \emph{GPU coroutine}, a novel abstraction that enables logical-to-physical resource decoupling. By dynamically mapping immutable virtual contexts to mutable physical resource via lightweight cooperative migration, CoGPU enables extensible, workload-aware scheduling without altering kernel semantics. Evaluations demonstrate CoGPU simultaneously achieves high utilization, strong isolation, and absolute semantic determinism (guaranteeing zero token mismatch). In multi-tenant co-location, it improves training throughput by up to 79.2\% over temporal sharing and reduces P99 inference tail latency by 15.1\%. Its pluggable architecture supports custom policies; compared to the default policy, a \textsc{TPOT-FIRST} policy further reduces SLO violations by 21.2\% under dynamic traffic.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces CoGPU, a transparent GPU spatial sharing system that uses a novel GPU coroutine abstraction to decouple immutable virtual contexts from mutable physical resources via lightweight cooperative migration. This is claimed to simultaneously deliver high utilization, strong performance isolation, and absolute semantic determinism (zero token mismatch) while improving training throughput by up to 79.2% over temporal sharing and reducing P99 inference tail latency by 15.1%.

Significance. If the core claims on semantic preservation hold, the work would meaningfully advance multi-tenant GPU scheduling for ML workloads by addressing the utilization-isolation-determinism trilemma without hardware changes or kernel modifications.

major comments (1)
  1. [Abstract] Abstract: The central guarantee of absolute semantic determinism and zero token mismatch rests on the unverified claim that cooperative migration of virtual contexts to physical resources always preserves exact kernel semantics, warp scheduling, memory interleaving, and floating-point reduction orders. No invariant, formal argument, or coverage of reduction-heavy kernels (e.g., attention or GEMM reductions) is supplied to support this.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our semantic determinism claims. We clarify the preservation mechanism enabled by the GPU coroutine abstraction and commit to strengthening the manuscript with additional formal arguments and targeted evaluations on reduction-heavy kernels.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central guarantee of absolute semantic determinism and zero token mismatch rests on the unverified claim that cooperative migration of virtual contexts to physical resources always preserves exact kernel semantics, warp scheduling, memory interleaving, and floating-point reduction orders. No invariant, formal argument, or coverage of reduction-heavy kernels (e.g., attention or GEMM reductions) is supplied to support this.

    Authors: We appreciate this observation. CoGPU's GPU coroutine captures the full immutable virtual context (registers, memory mappings, program counters) at cooperative yield points chosen to be semantically neutral. Migration remaps this context to new physical resources while preserving the exact logical execution sequence, warp scheduling order, and memory interleaving as observed by the kernel code. Consequently, floating-point reduction orders in kernels such as attention and GEMM remain identical to non-shared execution because data dependencies and operation sequences are unchanged by the physical remapping. In the revised version we will add (1) an explicit invariant stating that cooperative migration preserves kernel-visible state and ordering, (2) a short formal argument based on the immutability of the virtual context, and (3) new experiments measuring token mismatch on attention and GEMM workloads under co-location. These additions directly address the lack of coverage and verification noted. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on novel system design and empirical results

full rationale

The paper introduces GPU coroutines as a new abstraction for logical-to-physical decoupling via cooperative migration, asserting that this preserves kernel semantics by construction of the mechanism. Performance numbers (79.2% throughput, 15.1% latency reduction) are presented as direct evaluation outcomes rather than predictions derived from fitted parameters or self-referential equations. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing premises in the abstract or described chain. The central trilemma resolution is framed as an engineering outcome of the proposed mapping, not reduced to its inputs by definition or renaming. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the correctness of the new GPU coroutine abstraction and its ability to maintain semantic equivalence during migration; no free parameters are mentioned, and the approach relies on standard assumptions about GPU execution semantics.

axioms (1)
  • domain assumption GPU kernels have deterministic semantics that remain invariant under cooperative context migration between virtual and physical resources.
    Invoked to guarantee zero token mismatch and semantic determinism.
invented entities (1)
  • GPU coroutine no independent evidence
    purpose: Abstraction enabling logical-to-physical resource decoupling via lightweight cooperative migration.
    Newly introduced mechanism to achieve the claimed trilemma resolution.

pith-pipeline@v0.9.0 · 5504 in / 1221 out tokens · 43899 ms · 2026-05-15T10:30:52.645447+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · 7 internal anchors

  1. [1]

    CUDA Runtime API :: CUDA Toolkit Documentation

    2025. CUDA Runtime API :: CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART_ _STREAM.html. (Accessed on 01/12/2025)

  2. [2]

    A World-Wide Leading AI Company Infrastructure Team. 2026. Pri- vate Communication regarding Production GPU Sharing Constraints. Personal Communication. Unpublished industry insights

  3. [3]

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 117–134

  4. [4]

    Rachata Ausavarungnirun, Vance Miller, Joshua Landgraf, Saugata Ghose, Jayneel Gandhi, Adwait Jog, Christopher J Rossbach, and Onur Mutlu. 2018. Mask: Redesigning the gpu memory hierarchy to support multi-application concurrency.ACM SIGPLAN Notices53, 2 (2018), 503–518

  5. [5]

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL)

  6. [6]

    Zhihao Bai, Zhen Zhang, Yibo Zhu, and Xin Jin. 2020. {PipeSwitch}: Fast pipelined context switching for deep learning applications. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 499–514

  7. [7]

    Zhihao Bai, Zhen Zhang, Yibo Zhu, and Xin Jin. 2020. PipeSwitch: fast pipelined context switching for deep learning applications. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI’20). USENIX Association, USA, Article 28, 16 pages

  8. [8]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  9. [9]

    Guoyu Chen, Srinivasan Subramaniyan, and Xiaorui Wang. 2024. Latency-Guaranteed Co-Location of Inference and Training for Reduc- ing Data Center Expenses. In2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS). 473–484. doi:10.1109/ ICDCS60910.2024.00051

  10. [10]

    Quan Chen, Hailong Yang, Jason Mars, and Lingjia Tang. 2016. Bay- max: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers. InProceedings of the Twenty-First International Conference on Architectural Support for Pro- gramming Languages and Operating Systems(Atlanta, Georgia, USA) (ASPLOS ’16). Association f...

  11. [11]

    KyungWoon Cho and Hyokyung Bahn. 2020. Performance Analysis of Thread Block Schedulers in GPGPU and Its Implications.Applied Sciences10, 24 (2020). doi:10.3390/app10249121

  12. [12]

    Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon, and Jaehyuk Huh. 2022. Serving heterogeneous machine learn- ing models on {Multi-GPU } servers with {Spatio-Temporal} sharing. In2022 USENIX Annual Technical Conference (USENIX ATC 22). 199– 216

  13. [13]

    Patrick H Coppock, Brian Zhang, Eliot H Solomon, Vasilis Kyprio- tis, Leon Yang, Bikash Sharma, Dan Schatzberg, Todd C Mowry, and Dimitrios Skarlatos. 2025. LithOS: An operating system for efficient machine learning on GPUs. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 1–17

  14. [14]

    NVIDIA Corporation. 2025.CUDA Multi-Process Service (MPS) Overview.https://docs.nvidia.com/deploy/mps/Describes the MPS client-server model that multiplexes multiple processes into a single CUDA context to reduce context-switch overhead and enable concur- rent kernel execution

  15. [15]

    NVIDIA Corporation. 2025.NVIDIA Multi-Instance GPU (MIG) User Guide.https://docs.nvidia.com/datacenter/tesla/mig-user-guide/De- scribes GPU partitioning into multiple isolated GPU instances with dedicated compute, cache, and memory resources, enabling spatial sharing with strong isolation

  16. [16]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

  17. [17]

    InProceedings of the 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA) (NIPS ’22)

    FLASHATTENTION: fast and memory-efficient exact attention with IO-awareness. InProceedings of the 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA) (NIPS ’22). Curran Associates Inc., Red Hook, NY, USA, Article 1189, 16 pages

  18. [18]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

  19. [19]

    InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)

    Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186

  20. [20]

    Aditya Dhakal, Sameer G Kulkarni, and K. K. Ramakrishnan. 2020. GSLICE: controlled spatial sharing of GPUs for a scalable inference platform. InProceedings of the 11th ACM Symposium on Cloud Com- puting(Virtual Event, USA)(SoCC ’20). Association for Computing Machinery, New York, NY, USA, 492–506. doi:10.1145/3419111.3421284

  21. [21]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kulka- rni, Gaurav Goel, Kanshul Nguyen, Punit Kulkarni, et al. 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783(2024)

  22. [22]

    Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low- Latency Serverless Inference for Large Language Models. In18th USENIX Symposium on Operating Systems Design and Implementa- tion (OSDI 24). USENIX Association, Santa Clara, CA, 135–153.https: //www.usenix.org/conference/osdi24/pr...

  23. [23]

    Gartner. 2025. Gartner Says AI-Optimized IaaS Is Poised to Become the Next Growth Engine for AI Infrastructure. https://www.gartner.com/en/newsroom/press-releases/2025-10- 15-gartner-says-artificial-intelligence-optimized-iaas-is-poised- to-become-the-next-growth-engine-for-artificial-intelligence- infrastructureAccessed: 2025-11-28

  24. [24]

    Guin Gilman, Samuel S Ogden, Tian Guo, and Robert J Walls. 2021. Demystifying the placement policies of the NVIDIA GPU thread block scheduler for concurrent kernels.ACM SIGMETRICS Performance Evaluation Review48, 3 (2021), 81–88

  25. [25]

    GLM-4 Team and Zhipu AI. 2024. GLM-4: Towards Open Source Lan- guage Models for Academic Research.arXiv preprint arXiv:2406.12793 (2024)

  26. [26]

    David Goldberg. 1991. What every computer scientist should know about floating-point arithmetic.ACM computing surveys (CSUR)23, 1 (1991), 5–48

  27. [27]

    Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kauf- mann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving DNNs like clockwork: performance predictability from the bottom up. InPro- ceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI’20). USENIX Association, USA, Article 25, 20 pages

  28. [28]

    Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. InProceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI). 1725–1731

  29. [29]

    Bing-Shiun Han, Tathagata Paul, Zhenhua Liu, and Anshul Gandhi

  30. [30]

    InProceedings of the 2024 ACM Symposium on Cloud Computing (Redmond, WA, USA)(SoCC ’24)

    KACE: Kernel-Aware Colocation for Efficient GPU Spatial Shar- ing. InProceedings of the 2024 ACM Symposium on Cloud Computing (Redmond, WA, USA)(SoCC ’24). Association for Computing Machin- ery, New York, NY, USA, 460–469. doi:10.1145/3698038.3698555

  31. [31]

    Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022. Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 539– 558.https://www.usenix.org/conference/osdi22/presentation/han

  32. [32]

    Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy Campbell

  33. [33]

    InProceedings of Machine Learning and Systems, A

    TicTac: Accelerating Distributed Deep Learning with Com- munication Scheduling. InProceedings of Machine Learning and Systems, A. Talwalkar, V. Smith, and M. Zaharia (Eds.), Vol. 1. 418–430.https://proceedings.mlsys.org/paper_files/paper/2019/file/ 94cb28874a503f34b3c4a41bddcea2bd-Paper.pdf

  34. [34]

    Horace He and Thinking Machines Lab. 2025. Defeating Nondetermin- ism in LLM Inference.Thinking Machines Lab: Connectionism(2025). doi:10.64434/tml.20250910https://thinkingmachines.ai/blog/defeating- nondeterminism-in-llm-inference/

  35. [35]

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 770–778

  36. [36]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations (ICLR)

  37. [37]

    2026.µShare: Non-Intrusive Kernel Co-Locating on NVIDIA GPUs.2026 IEEE International Sympo- sium on High Performance Computer Architecture (HPCA)(2026), 1–14

    Wenhao Huang, Zhaolin Duan, Laiping Zhao, Yuhao Zhang, Yanjie Wang, Yiming Li, Yihan Wang, Yichi Chen, Zhihang Tang, Kang Chen, Deze Zeng, Wenxin Li, and Keqiu Li. 2026.µShare: Non-Intrusive Kernel Co-Locating on NVIDIA GPUs.2026 IEEE International Sympo- sium on High Performance Computer Architecture (HPCA)(2026), 1–14. https://api.semanticscholar.org/Co...

  38. [38]

    Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of large-scale multi- tenant GPU clusters for DNN training workloads. In2019 USENIX Annual Technical Conference (USENIX ATC 19). 947–960

  39. [39]

    Jaehoon Jung, Jinpyo Kim, and Jaejin Lee. 2023. Deepum: Tensor mi- gration and prefetching in unified memory. InProceedings of the 28th ACM International Conference on Architectural Support for Program- ming Languages and Operating Systems, Volume 2. 207–221

  40. [40]

    Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ra- machandran Ramjee, and Ashish Panwar. 2025. Pod-attention: Unlock- ing full prefill-decode overlap for faster llm inference. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 897–912

  41. [41]

    Yueying Li, Zhanqiu Hu, Esha Choukse, Rodrigo Fonseca, G Edward Suh, and Udit Gupta. 2025. Ecoserve: Designing carbon-aware ai inference systems.arXiv preprint arXiv:2502.05043(2025)

  42. [42]

    Gonzalez, and Ion Stoica

    Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX As- sociation, Boston, MA...

  43. [43]

    Jaiaid Mobin, Avinash Maurya, and M Mustafa Rafique. 2023. COLTI: Towards Concurrent and Co-located DNN Training and Inference. In Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing. 309–310

  44. [44]

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient large-scale language model train- ing on GPU clusters using megatron-LM. InProceedings of the In- ternational Conference for ...

  45. [45]

    Kelvin K. W. Ng, Henri Maxime Demoulin, and Vincent Liu. 2023. Paella: Low-latency Model Serving with Software-defined GPU Sched- uling. InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23). Association for Computing Machinery, New York, NY, USA, 595–610. doi:10.1145/3600006.3613163

  46. [46]

    NVIDIA. 2025. CUDA driver API – Green Contexts. https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA_ _GREEN__CONTEXTS.htmlAccessed: 2025-12-10

  47. [47]

    2024.cuBLAS Library Documentation.https: //docs.nvidia.com/cuda/cublas/index.htmlAccessed: 2026-03-26

    NVIDIA Corporation. 2024.cuBLAS Library Documentation.https: //docs.nvidia.com/cuda/cublas/index.htmlAccessed: 2026-03-26

  48. [48]

    2024.CUDA C++ Programming Guide.https: //docs.nvidia.com/cuda/cuda-c-programming-guide/index.htmlAc- cessed: 2024

    NVIDIA Corporation. 2024.CUDA C++ Programming Guide.https: //docs.nvidia.com/cuda/cuda-c-programming-guide/index.htmlAc- cessed: 2024

  49. [49]

    OpenAI. 2023. ChatGPT.https://chat.openai.com. Accessed: 2025-11- 28

  50. [50]

    Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2015. Chimera: Collaborative Preemption for Multitasking on a Shared GPU. InPro- ceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems(Istanbul, Turkey)(ASPLOS ’15). Association for Computing Machinery, New York, NY, USA, 593–606. doi...

  51. [51]

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2025. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. InProceedings of the 51st Annual International Symposium on Computer Architecture (Buenos Aires, Argentina)(ISCA ’24). IEEE Press, 118–132. doi:10. 1109/ISCA59077.2024.00019

  52. [52]

    Manos Pavlidakis, Giorgos Vasiliadis, Stelios Mavridis, Anargyros Argyros, Antony Chazapis, and Angelos Bilas. 2024. Guardian: Safe GPU Sharing in Multi-Tenant Environments. InProceedings of the 25th International Middleware Conference(Hong Kong, Hong Kong) (Middleware ’24). Association for Computing Machinery, New York, 14 NY, USA, 313–326. doi:10.1145/3...

  53. [53]

    Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. 2019. ByteScheduler: A generic communication scheduler for distributed DNN training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP ’19). 516–529

  54. [54]

    Aleksei Petrenko, Ben Lipkin, Kevin Chen, Erik Wijmans, Marco Cusumano-Towner, Raja Giryes, and Philipp Krähenbühl. 2026. Entropy-Preserving Reinforcement Learning. InInternational Con- ference on Learning Representations (ICLR)

  55. [55]

    David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1986. Learning representations by back-propagating errors.Nature323, 6088 (1986), 533–536

  56. [56]

    SGLang Contributors. 2025. Deterministic Inference — SGLang Docu- mentation.https://docs.sglang.io/advanced_features/deterministic_ inference.html. Accessed: 2026-03-26

  57. [57]

    Weihang Shen, Mingcong Han, Jialong Liu, Rong Chen, and Haibo Chen. 2025. XSched: preemptive scheduling for diverse XPUs. InPro- ceedings of the 19th USENIX Conference on Operating Systems Design and Implementation(Boston, MA, USA)(OSDI ’25). USENIX Associa- tion, USA, Article 37, 22 pages

  58. [58]

    Dharma Shukla, Muthian Sivathanu, Srinidhi Viswanatha, Bhargav Gulavani, Rimma Nehme, Amey Agrawal, Chen Chen, Nipun Kwatra, Ramachandran Ramjee, Pankaj Sharma, Atul Katiyar, Vipul Modi, Vaibhav Sharma, Abhishek Singh, Shreshth Singhal, Kaustubh We- lankar, Lu Xun, Ravi Anupindi, Karthik Elangovan, Hasibur Rehman, Zhou Lin, Rahul Seetharaman, Cheng Xu, Ed...

  59. [59]

    2024.Cloud-based AI Ac- tivity for HPC: Widespread but Primarily Exploratory

    Tom Sorensen and Bob Sorensen. 2024.Cloud-based AI Ac- tivity for HPC: Widespread but Primarily Exploratory. Tech- nical Report HR4.0492.09.20.2024. Hyperion Research.https: //hyperionresearch.com/wp-content/uploads/2024/09/Hyperion- Research-Special-Report-AI-in-the-Cloud-September-2024.pdf Accessed: 2025-11-28

  60. [60]

    2025.OpenClaw: Per- sonal AI Assistant

    Peter Steinberger and OpenClaw Contributors. 2025.OpenClaw: Per- sonal AI Assistant

  61. [61]

    Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. 2025. DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency. InProceedings of the 31st IEEE International Symposium on High-Performance Computer Architecture (HPCA). Best Paper Award

  62. [62]

    Foteini Strati, Xianzhe Ma, and Ana Klimovic. 2024. Orion: Interference-aware, fine-grained gpu sharing for ml applications. In Proceedings of the Nineteenth European Conference on Computer Systems. 1075–1092

  63. [63]

    Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and Mateo Valero. 2014. Enabling preemptive multipro- gramming on GPUs. InProceeding of the 41st Annual International Symposium on Computer Architecuture(Minneapolis, Minnesota, USA) (ISCA ’14). IEEE Press, 193–204

  64. [64]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multi- modal models.arXiv preprint arXiv:2312.11805(2023)

  65. [65]

    Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388

  66. [66]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]https://arxiv. org/abs/2302.13971

  67. [67]

    Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Param- eter Autoregressive Language Model.https://github.com/kingoflolz/ mesh-transformer-jax. EleutherAI

  68. [68]

    Guanhua Wang, Kehan Wang, Kenan Jiang, Xiangjun Li, and Ion Stoica. 2021. Wavelet: Efficient DNN training with tick-tock scheduling. Proceedings of Machine Learning and Systems3 (2021), 696–710

  69. [69]

    Yuxin Wang, Yibo Chen, Zhaozhu Li, Xinyu Kang, Yinan Fang, Yang- tian Zhou, Yujie Zheng, Zhennan Tang, Xiuming He, Rong Guo, Xin Wang, Qiang Wang, Aoying Zhou, and Xiaowen Chu. 2025. BurstGPT: A Real-World Workload Dataset to Optimize LLM Serving Systems. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discov- ery and Data Mining (KDD)

  70. [70]

    Xingda Wei, Zhuobin Huang, Tianle Sun, Yingyi Hao, Rong Chen, Mingcong Han, Jinyu Gu, and Haibo Chen. 2025. PhoenixOS: Concur- rent OS-level GPU Checkpoint and Restore with Validated Speculation. InProceedings of the ACM SIGOPS 31st Symposium on Operating Sys- tems Principles(Lotte Hotel World, Seoul, Republic of Korea)(SOSP ’25). Association for Computin...

  71. [71]

    Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. 2022. MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Het- erogeneous GPU Clusters. In19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). USENIX Association, Renton, WA, 945–960.https://www.useni...

  72. [72]

    Bingyang Wu, Zili Zhang, Zhihao Bai, Xuanzhe Liu, and Xin Jin. 2023. Transparent {GPU } sharing in container clouds for deep learning workloads. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 69–85

  73. [73]

    Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, et al. 2018. Gandiva: Introspective cluster scheduling for deep learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 595–610

  74. [74]

    Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. 2020. {AntMan}: Dynamic scaling on {GPU } clusters for deep learning. In14th USENIX Sym- posium on Operating Systems Design and Implementation (OSDI 20). 533–548

  75. [75]

    Peichen Xie, Yang Wang, Fan Yang, and Mao Yang. 2025. MMA- Sim: Bit-Accurate Reference Model of Tensor Cores and Matrix Cores. arXiv:2511.10909 [cs.AR]https://arxiv.org/abs/2511.10909

  76. [76]

    Peichen Xie, Xian Zhang, and Shuo Chen. 2025. RepDL: Bit-level Reproducible Deep Learning Training and Inference. arXiv:2510.09180 [cs.LG]https://arxiv.org/abs/2510.09180

  77. [77]

    Qiumin Xu, Hyeran Jeon, Keunsoo Kim, Won Woo Ro, and Murali Annavaram. 2016. Warped-slicer: efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming. InPro- ceedings of the 43rd International Symposium on Computer Archi- tecture(Seoul, Republic of Korea)(ISCA ’16). IEEE Press, 230–242. doi:10.1109/ISCA.2016.29

  78. [78]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521–538

  79. [79]

    Peifeng Yu and Mosharaf Chowdhury. 2020. Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications. InProceedings of the 3rd MLSys Conference (MLSys). Austin, TX, USA

  80. [80]

    Anwar Hossain Zahid, Ignacio Laguna, and Wei Le. 2025. Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs. InProceedings of the SC ’24 Workshops of the International Conference on High Performance Computing, Network, Storage, and 15 Analysis(Atlanta, GA, USA)(SC-W ’24). IEEE Press, 547–557. doi:10. 1109/SCW63240.2024.00077

Showing first 80 references.