Performance Isolation and Semantic Determinism in Efficient GPU Spatial Sharing
Pith reviewed 2026-05-15 10:30 UTC · model grok-4.3
The pith
CoGPU uses GPU coroutines to share GPUs spatially while preserving exact kernel semantics, isolation, and zero token mismatch.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoGPU resolves the three-way tradeoff in GPU spatial sharing by introducing GPU coroutines that enable dynamic mapping of immutable virtual contexts to mutable physical resources via lightweight cooperative migration, thereby achieving high utilization, strong performance isolation, and absolute semantic determinism that guarantees zero token mismatch across co-located workloads.
What carries the argument
GPU coroutine, an abstraction for logical-to-physical resource decoupling that uses lightweight cooperative migration to preserve exact kernel semantics and floating-point reduction orders.
Load-bearing premise
Lightweight cooperative migration between immutable virtual contexts and mutable physical resources can always preserve exact kernel semantics and floating-point reduction orders across diverse workloads without hidden interference or overhead.
What would settle it
Run the same generative model in isolation and then again while co-located with other workloads on CoGPU; any token output difference would disprove the zero-mismatch claim.
Figures
read the original abstract
Existing GPU spatial sharing systems face a three-way tradeoff: resource utilization, performance isolation, and semantic determinism. Hardware partitioning suffers from hardware under-utilization. Hardware multiplexing fails to avoid performance interference. Recently proposed software-based GPU kernel slicing reshapes floating-point reduction orders, destroying semantic determinism and inducing catastrophic token drift in generative models. We present CoGPU, a transparent spatial sharing system that resolves this trilemma. CoGPU introduces \emph{GPU coroutine}, a novel abstraction that enables logical-to-physical resource decoupling. By dynamically mapping immutable virtual contexts to mutable physical resource via lightweight cooperative migration, CoGPU enables extensible, workload-aware scheduling without altering kernel semantics. Evaluations demonstrate CoGPU simultaneously achieves high utilization, strong isolation, and absolute semantic determinism (guaranteeing zero token mismatch). In multi-tenant co-location, it improves training throughput by up to 79.2\% over temporal sharing and reduces P99 inference tail latency by 15.1\%. Its pluggable architecture supports custom policies; compared to the default policy, a \textsc{TPOT-FIRST} policy further reduces SLO violations by 21.2\% under dynamic traffic.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CoGPU, a transparent GPU spatial sharing system that uses a novel GPU coroutine abstraction to decouple immutable virtual contexts from mutable physical resources via lightweight cooperative migration. This is claimed to simultaneously deliver high utilization, strong performance isolation, and absolute semantic determinism (zero token mismatch) while improving training throughput by up to 79.2% over temporal sharing and reducing P99 inference tail latency by 15.1%.
Significance. If the core claims on semantic preservation hold, the work would meaningfully advance multi-tenant GPU scheduling for ML workloads by addressing the utilization-isolation-determinism trilemma without hardware changes or kernel modifications.
major comments (1)
- [Abstract] Abstract: The central guarantee of absolute semantic determinism and zero token mismatch rests on the unverified claim that cooperative migration of virtual contexts to physical resources always preserves exact kernel semantics, warp scheduling, memory interleaving, and floating-point reduction orders. No invariant, formal argument, or coverage of reduction-heavy kernels (e.g., attention or GEMM reductions) is supplied to support this.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our semantic determinism claims. We clarify the preservation mechanism enabled by the GPU coroutine abstraction and commit to strengthening the manuscript with additional formal arguments and targeted evaluations on reduction-heavy kernels.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central guarantee of absolute semantic determinism and zero token mismatch rests on the unverified claim that cooperative migration of virtual contexts to physical resources always preserves exact kernel semantics, warp scheduling, memory interleaving, and floating-point reduction orders. No invariant, formal argument, or coverage of reduction-heavy kernels (e.g., attention or GEMM reductions) is supplied to support this.
Authors: We appreciate this observation. CoGPU's GPU coroutine captures the full immutable virtual context (registers, memory mappings, program counters) at cooperative yield points chosen to be semantically neutral. Migration remaps this context to new physical resources while preserving the exact logical execution sequence, warp scheduling order, and memory interleaving as observed by the kernel code. Consequently, floating-point reduction orders in kernels such as attention and GEMM remain identical to non-shared execution because data dependencies and operation sequences are unchanged by the physical remapping. In the revised version we will add (1) an explicit invariant stating that cooperative migration preserves kernel-visible state and ordering, (2) a short formal argument based on the immutability of the virtual context, and (3) new experiments measuring token mismatch on attention and GEMM workloads under co-location. These additions directly address the lack of coverage and verification noted. revision: yes
Circularity Check
No circularity; claims rest on novel system design and empirical results
full rationale
The paper introduces GPU coroutines as a new abstraction for logical-to-physical decoupling via cooperative migration, asserting that this preserves kernel semantics by construction of the mechanism. Performance numbers (79.2% throughput, 15.1% latency reduction) are presented as direct evaluation outcomes rather than predictions derived from fitted parameters or self-referential equations. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing premises in the abstract or described chain. The central trilemma resolution is framed as an engineering outcome of the proposed mapping, not reduced to its inputs by definition or renaming. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption GPU kernels have deterministic semantics that remain invariant under cooperative context migration between virtual and physical resources.
invented entities (1)
-
GPU coroutine
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By dynamically mapping immutable virtual contexts to mutable physical resource via lightweight cooperative migration, CoGPU enables extensible, workload-aware scheduling without altering kernel semantics.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
guaranteeing zero token mismatch... preserves the original thread-block scheduling order
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
CUDA Runtime API :: CUDA Toolkit Documentation
2025. CUDA Runtime API :: CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART_ _STREAM.html. (Accessed on 01/12/2025)
work page 2025
-
[2]
A World-Wide Leading AI Company Infrastructure Team. 2026. Pri- vate Communication regarding Production GPU Sharing Constraints. Personal Communication. Unpublished industry insights
work page 2026
-
[3]
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 117–134
work page 2024
-
[4]
Rachata Ausavarungnirun, Vance Miller, Joshua Landgraf, Saugata Ghose, Jayneel Gandhi, Adwait Jog, Christopher J Rossbach, and Onur Mutlu. 2018. Mask: Redesigning the gpu memory hierarchy to support multi-application concurrency.ACM SIGPLAN Notices53, 2 (2018), 503–518
work page 2018
-
[5]
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL)
work page 2024
-
[6]
Zhihao Bai, Zhen Zhang, Yibo Zhu, and Xin Jin. 2020. {PipeSwitch}: Fast pipelined context switching for deep learning applications. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 499–514
work page 2020
-
[7]
Zhihao Bai, Zhen Zhang, Yibo Zhu, and Xin Jin. 2020. PipeSwitch: fast pipelined context switching for deep learning applications. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI’20). USENIX Association, USA, Article 28, 16 pages
work page 2020
-
[8]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
work page 2020
-
[9]
Guoyu Chen, Srinivasan Subramaniyan, and Xiaorui Wang. 2024. Latency-Guaranteed Co-Location of Inference and Training for Reduc- ing Data Center Expenses. In2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS). 473–484. doi:10.1109/ ICDCS60910.2024.00051
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Quan Chen, Hailong Yang, Jason Mars, and Lingjia Tang. 2016. Bay- max: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers. InProceedings of the Twenty-First International Conference on Architectural Support for Pro- gramming Languages and Operating Systems(Atlanta, Georgia, USA) (ASPLOS ’16). Association f...
-
[11]
KyungWoon Cho and Hyokyung Bahn. 2020. Performance Analysis of Thread Block Schedulers in GPGPU and Its Implications.Applied Sciences10, 24 (2020). doi:10.3390/app10249121
-
[12]
Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon, and Jaehyuk Huh. 2022. Serving heterogeneous machine learn- ing models on {Multi-GPU } servers with {Spatio-Temporal} sharing. In2022 USENIX Annual Technical Conference (USENIX ATC 22). 199– 216
work page 2022
-
[13]
Patrick H Coppock, Brian Zhang, Eliot H Solomon, Vasilis Kyprio- tis, Leon Yang, Bikash Sharma, Dan Schatzberg, Todd C Mowry, and Dimitrios Skarlatos. 2025. LithOS: An operating system for efficient machine learning on GPUs. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 1–17
work page 2025
-
[14]
NVIDIA Corporation. 2025.CUDA Multi-Process Service (MPS) Overview.https://docs.nvidia.com/deploy/mps/Describes the MPS client-server model that multiplexes multiple processes into a single CUDA context to reduce context-switch overhead and enable concur- rent kernel execution
work page 2025
-
[15]
NVIDIA Corporation. 2025.NVIDIA Multi-Instance GPU (MIG) User Guide.https://docs.nvidia.com/datacenter/tesla/mig-user-guide/De- scribes GPU partitioning into multiple isolated GPU instances with dedicated compute, cache, and memory resources, enabling spatial sharing with strong isolation
work page 2025
-
[16]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
-
[17]
FLASHATTENTION: fast and memory-efficient exact attention with IO-awareness. InProceedings of the 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA) (NIPS ’22). Curran Associates Inc., Red Hook, NY, USA, Article 1189, 16 pages
-
[18]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova
-
[19]
Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186
work page 2019
-
[20]
Aditya Dhakal, Sameer G Kulkarni, and K. K. Ramakrishnan. 2020. GSLICE: controlled spatial sharing of GPUs for a scalable inference platform. InProceedings of the 11th ACM Symposium on Cloud Com- puting(Virtual Event, USA)(SoCC ’20). Association for Computing Machinery, New York, NY, USA, 492–506. doi:10.1145/3419111.3421284
-
[21]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kulka- rni, Gaurav Goel, Kanshul Nguyen, Punit Kulkarni, et al. 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low- Latency Serverless Inference for Large Language Models. In18th USENIX Symposium on Operating Systems Design and Implementa- tion (OSDI 24). USENIX Association, Santa Clara, CA, 135–153.https: //www.usenix.org/conference/osdi24/pr...
work page 2024
-
[23]
Gartner. 2025. Gartner Says AI-Optimized IaaS Is Poised to Become the Next Growth Engine for AI Infrastructure. https://www.gartner.com/en/newsroom/press-releases/2025-10- 15-gartner-says-artificial-intelligence-optimized-iaas-is-poised- to-become-the-next-growth-engine-for-artificial-intelligence- infrastructureAccessed: 2025-11-28
work page 2025
-
[24]
Guin Gilman, Samuel S Ogden, Tian Guo, and Robert J Walls. 2021. Demystifying the placement policies of the NVIDIA GPU thread block scheduler for concurrent kernels.ACM SIGMETRICS Performance Evaluation Review48, 3 (2021), 81–88
work page 2021
-
[25]
GLM-4 Team and Zhipu AI. 2024. GLM-4: Towards Open Source Lan- guage Models for Academic Research.arXiv preprint arXiv:2406.12793 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
David Goldberg. 1991. What every computer scientist should know about floating-point arithmetic.ACM computing surveys (CSUR)23, 1 (1991), 5–48
work page 1991
-
[27]
Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kauf- mann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving DNNs like clockwork: performance predictability from the bottom up. InPro- ceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI’20). USENIX Association, USA, Article 25, 20 pages
work page 2020
-
[28]
Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. InProceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI). 1725–1731
work page 2017
-
[29]
Bing-Shiun Han, Tathagata Paul, Zhenhua Liu, and Anshul Gandhi
-
[30]
InProceedings of the 2024 ACM Symposium on Cloud Computing (Redmond, WA, USA)(SoCC ’24)
KACE: Kernel-Aware Colocation for Efficient GPU Spatial Shar- ing. InProceedings of the 2024 ACM Symposium on Cloud Computing (Redmond, WA, USA)(SoCC ’24). Association for Computing Machin- ery, New York, NY, USA, 460–469. doi:10.1145/3698038.3698555
-
[31]
Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022. Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 539– 558.https://www.usenix.org/conference/osdi22/presentation/han
work page 2022
-
[32]
Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy Campbell
-
[33]
InProceedings of Machine Learning and Systems, A
TicTac: Accelerating Distributed Deep Learning with Com- munication Scheduling. InProceedings of Machine Learning and Systems, A. Talwalkar, V. Smith, and M. Zaharia (Eds.), Vol. 1. 418–430.https://proceedings.mlsys.org/paper_files/paper/2019/file/ 94cb28874a503f34b3c4a41bddcea2bd-Paper.pdf
work page 2019
-
[34]
Horace He and Thinking Machines Lab. 2025. Defeating Nondetermin- ism in LLM Inference.Thinking Machines Lab: Connectionism(2025). doi:10.64434/tml.20250910https://thinkingmachines.ai/blog/defeating- nondeterminism-in-llm-inference/
work page doi:10.64434/tml.20250910https://thinkingmachines.ai/blog/defeating- 2025
-
[35]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 770–778
work page 2016
-
[36]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations (ICLR)
work page 2022
-
[37]
Wenhao Huang, Zhaolin Duan, Laiping Zhao, Yuhao Zhang, Yanjie Wang, Yiming Li, Yihan Wang, Yichi Chen, Zhihang Tang, Kang Chen, Deze Zeng, Wenxin Li, and Keqiu Li. 2026.µShare: Non-Intrusive Kernel Co-Locating on NVIDIA GPUs.2026 IEEE International Sympo- sium on High Performance Computer Architecture (HPCA)(2026), 1–14. https://api.semanticscholar.org/Co...
work page 2026
-
[38]
Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of large-scale multi- tenant GPU clusters for DNN training workloads. In2019 USENIX Annual Technical Conference (USENIX ATC 19). 947–960
work page 2019
-
[39]
Jaehoon Jung, Jinpyo Kim, and Jaejin Lee. 2023. Deepum: Tensor mi- gration and prefetching in unified memory. InProceedings of the 28th ACM International Conference on Architectural Support for Program- ming Languages and Operating Systems, Volume 2. 207–221
work page 2023
-
[40]
Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ra- machandran Ramjee, and Ashish Panwar. 2025. Pod-attention: Unlock- ing full prefill-decode overlap for faster llm inference. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 897–912
work page 2025
- [41]
-
[42]
Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX As- sociation, Boston, MA...
work page 2023
-
[43]
Jaiaid Mobin, Avinash Maurya, and M Mustafa Rafique. 2023. COLTI: Towards Concurrent and Co-located DNN Training and Inference. In Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing. 309–310
work page 2023
-
[44]
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient large-scale language model train- ing on GPU clusters using megatron-LM. InProceedings of the In- ternational Conference for ...
-
[45]
Kelvin K. W. Ng, Henri Maxime Demoulin, and Vincent Liu. 2023. Paella: Low-latency Model Serving with Software-defined GPU Sched- uling. InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23). Association for Computing Machinery, New York, NY, USA, 595–610. doi:10.1145/3600006.3613163
-
[46]
NVIDIA. 2025. CUDA driver API – Green Contexts. https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA_ _GREEN__CONTEXTS.htmlAccessed: 2025-12-10
work page 2025
-
[47]
NVIDIA Corporation. 2024.cuBLAS Library Documentation.https: //docs.nvidia.com/cuda/cublas/index.htmlAccessed: 2026-03-26
work page 2024
-
[48]
NVIDIA Corporation. 2024.CUDA C++ Programming Guide.https: //docs.nvidia.com/cuda/cuda-c-programming-guide/index.htmlAc- cessed: 2024
work page 2024
-
[49]
OpenAI. 2023. ChatGPT.https://chat.openai.com. Accessed: 2025-11- 28
work page 2023
-
[50]
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2015. Chimera: Collaborative Preemption for Multitasking on a Shared GPU. InPro- ceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems(Istanbul, Turkey)(ASPLOS ’15). Association for Computing Machinery, New York, NY, USA, 593–606. doi...
-
[51]
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2025. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. InProceedings of the 51st Annual International Symposium on Computer Architecture (Buenos Aires, Argentina)(ISCA ’24). IEEE Press, 118–132. doi:10. 1109/ISCA59077.2024.00019
-
[52]
Manos Pavlidakis, Giorgos Vasiliadis, Stelios Mavridis, Anargyros Argyros, Antony Chazapis, and Angelos Bilas. 2024. Guardian: Safe GPU Sharing in Multi-Tenant Environments. InProceedings of the 25th International Middleware Conference(Hong Kong, Hong Kong) (Middleware ’24). Association for Computing Machinery, New York, 14 NY, USA, 313–326. doi:10.1145/3...
-
[53]
Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. 2019. ByteScheduler: A generic communication scheduler for distributed DNN training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP ’19). 516–529
work page 2019
-
[54]
Aleksei Petrenko, Ben Lipkin, Kevin Chen, Erik Wijmans, Marco Cusumano-Towner, Raja Giryes, and Philipp Krähenbühl. 2026. Entropy-Preserving Reinforcement Learning. InInternational Con- ference on Learning Representations (ICLR)
work page 2026
-
[55]
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1986. Learning representations by back-propagating errors.Nature323, 6088 (1986), 533–536
work page 1986
-
[56]
SGLang Contributors. 2025. Deterministic Inference — SGLang Docu- mentation.https://docs.sglang.io/advanced_features/deterministic_ inference.html. Accessed: 2026-03-26
work page 2025
-
[57]
Weihang Shen, Mingcong Han, Jialong Liu, Rong Chen, and Haibo Chen. 2025. XSched: preemptive scheduling for diverse XPUs. InPro- ceedings of the 19th USENIX Conference on Operating Systems Design and Implementation(Boston, MA, USA)(OSDI ’25). USENIX Associa- tion, USA, Article 37, 22 pages
work page 2025
-
[58]
Dharma Shukla, Muthian Sivathanu, Srinidhi Viswanatha, Bhargav Gulavani, Rimma Nehme, Amey Agrawal, Chen Chen, Nipun Kwatra, Ramachandran Ramjee, Pankaj Sharma, Atul Katiyar, Vipul Modi, Vaibhav Sharma, Abhishek Singh, Shreshth Singhal, Kaustubh We- lankar, Lu Xun, Ravi Anupindi, Karthik Elangovan, Hasibur Rehman, Zhou Lin, Rahul Seetharaman, Cheng Xu, Ed...
work page 2022
-
[59]
2024.Cloud-based AI Ac- tivity for HPC: Widespread but Primarily Exploratory
Tom Sorensen and Bob Sorensen. 2024.Cloud-based AI Ac- tivity for HPC: Widespread but Primarily Exploratory. Tech- nical Report HR4.0492.09.20.2024. Hyperion Research.https: //hyperionresearch.com/wp-content/uploads/2024/09/Hyperion- Research-Special-Report-AI-in-the-Cloud-September-2024.pdf Accessed: 2025-11-28
work page 2024
-
[60]
2025.OpenClaw: Per- sonal AI Assistant
Peter Steinberger and OpenClaw Contributors. 2025.OpenClaw: Per- sonal AI Assistant
work page 2025
-
[61]
Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. 2025. DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency. InProceedings of the 31st IEEE International Symposium on High-Performance Computer Architecture (HPCA). Best Paper Award
work page 2025
-
[62]
Foteini Strati, Xianzhe Ma, and Ana Klimovic. 2024. Orion: Interference-aware, fine-grained gpu sharing for ml applications. In Proceedings of the Nineteenth European Conference on Computer Systems. 1075–1092
work page 2024
-
[63]
Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and Mateo Valero. 2014. Enabling preemptive multipro- gramming on GPUs. InProceeding of the 41st Annual International Symposium on Computer Architecuture(Minneapolis, Minnesota, USA) (ISCA ’14). IEEE Press, 193–204
work page 2014
-
[64]
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multi- modal models.arXiv preprint arXiv:2312.11805(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[65]
Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]https://arxiv. org/abs/2302.13971
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[67]
Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Param- eter Autoregressive Language Model.https://github.com/kingoflolz/ mesh-transformer-jax. EleutherAI
work page 2021
-
[68]
Guanhua Wang, Kehan Wang, Kenan Jiang, Xiangjun Li, and Ion Stoica. 2021. Wavelet: Efficient DNN training with tick-tock scheduling. Proceedings of Machine Learning and Systems3 (2021), 696–710
work page 2021
-
[69]
Yuxin Wang, Yibo Chen, Zhaozhu Li, Xinyu Kang, Yinan Fang, Yang- tian Zhou, Yujie Zheng, Zhennan Tang, Xiuming He, Rong Guo, Xin Wang, Qiang Wang, Aoying Zhou, and Xiaowen Chu. 2025. BurstGPT: A Real-World Workload Dataset to Optimize LLM Serving Systems. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discov- ery and Data Mining (KDD)
work page 2025
-
[70]
Xingda Wei, Zhuobin Huang, Tianle Sun, Yingyi Hao, Rong Chen, Mingcong Han, Jinyu Gu, and Haibo Chen. 2025. PhoenixOS: Concur- rent OS-level GPU Checkpoint and Restore with Validated Speculation. InProceedings of the ACM SIGOPS 31st Symposium on Operating Sys- tems Principles(Lotte Hotel World, Seoul, Republic of Korea)(SOSP ’25). Association for Computin...
-
[71]
Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. 2022. MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Het- erogeneous GPU Clusters. In19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). USENIX Association, Renton, WA, 945–960.https://www.useni...
work page 2022
-
[72]
Bingyang Wu, Zili Zhang, Zhihao Bai, Xuanzhe Liu, and Xin Jin. 2023. Transparent {GPU } sharing in container clouds for deep learning workloads. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 69–85
work page 2023
-
[73]
Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, et al. 2018. Gandiva: Introspective cluster scheduling for deep learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 595–610
work page 2018
-
[74]
Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. 2020. {AntMan}: Dynamic scaling on {GPU } clusters for deep learning. In14th USENIX Sym- posium on Operating Systems Design and Implementation (OSDI 20). 533–548
work page 2020
-
[75]
Peichen Xie, Yang Wang, Fan Yang, and Mao Yang. 2025. MMA- Sim: Bit-Accurate Reference Model of Tensor Cores and Matrix Cores. arXiv:2511.10909 [cs.AR]https://arxiv.org/abs/2511.10909
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [76]
-
[77]
Qiumin Xu, Hyeran Jeon, Keunsoo Kim, Won Woo Ro, and Murali Annavaram. 2016. Warped-slicer: efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming. InPro- ceedings of the 43rd International Symposium on Computer Archi- tecture(Seoul, Republic of Korea)(ISCA ’16). IEEE Press, 230–242. doi:10.1109/ISCA.2016.29
-
[78]
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521–538
work page 2022
-
[79]
Peifeng Yu and Mosharaf Chowdhury. 2020. Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications. InProceedings of the 3rd MLSys Conference (MLSys). Austin, TX, USA
work page 2020
-
[80]
Anwar Hossain Zahid, Ignacio Laguna, and Wei Le. 2025. Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs. InProceedings of the SC ’24 Workshops of the International Conference on High Performance Computing, Network, Storage, and 15 Analysis(Atlanta, GA, USA)(SC-W ’24). IEEE Press, 547–557. doi:10. 1109/SCW63240.2024.00077
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.