Recognition: no theorem link
Valve: Production Online-Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate
Pith reviewed 2026-05-10 18:00 UTC · model grok-4.3
The pith
Valve enables safe colocation of online and offline LLM inference on GPUs by jointly bounding preemption latency and rate.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Valve is a production-friendly colocation system that jointly bounds preemption latency and preemption rate. Specifically, it enables sub-millisecond compute preemption at most once per online request, and rate-limited sub-layer memory reclamation. These guarantees are provided by a GPU runtime that combines channel-controlled compute isolation, page-fault-free memory reclamation, and dynamic memory reservation. It requires only one line of driver modification and 20 lines of framework patch.
What carries the argument
A GPU runtime combining channel-controlled compute isolation, page-fault-free memory reclamation, and dynamic memory reservation, which jointly bounds preemption latency to sub-millisecond and preemption rate to at most one per online request.
If this is right
- Improves cluster utilization by 34.6%, equivalent to saving 2,170 GPUs in the deployed cluster.
- Incurs less than 5% increase in TTFT and less than 2% in TPOT across workloads.
- Enables deployment on 8,054 GPUs in production with minimal code changes.
- Supports colocation of different models while maintaining preemption guarantees.
Where Pith is reading between the lines
- Similar bounding techniques could extend to other hardware accelerators if analogous runtime primitives exist.
- Reducing preemption overhead this way may allow more aggressive colocation policies in future scheduling systems.
- The minimal modification requirement suggests Valve could be adopted broadly without major framework overhauls.
Load-bearing premise
The GPU runtime primitives for isolation, reclamation, and reservation can be implemented with one line of driver change and twenty lines of framework patch while maintaining correctness for all production models and drivers.
What would settle it
A measurement in production showing either preemption latency greater than one millisecond or more than one preemption per online request would falsify the joint bounding claim.
Figures
read the original abstract
LLM inference powers latency-critical production services nowadays. The bursty nature of inference traffic results in over-provisioning, which in turn leads to resource underutilization. While online-offline colocation promises to utilize idle capacity, broad production deployment must overcome two major challenges: (i) large online interference due to slow or frequent preemptions, and (ii) extensive frameworks and drivers modifications, to colocate different models and support preemptions. We present Valve, a production-friendly colocation system that jointly bounds preemption latency and preemption rate. Specifically, Valve enables sub-millisecond compute preemption at most once per online request, and rate-limited sub-layer memory reclamation. These guaranties are provided by a GPU runtime that combines channel-controlled compute isolation, page-fault-free memory reclamation, and dynamic memory reservation. Critically, Valve is practical to deploy, requiring one line of driver modification and 20 lines of framework patch. Deployed on 8,054 GPUs in production, Valve improves cluster utilization by 34.6%, which translates to a 2,170 GPU save. This efficiency gains is achieved with minimal online interference, incurring <5% TTFT increase and <2% TPOT increase across workloads.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Valve, a production system for online-offline colocation of LLM inference on GPUs. It claims to jointly bound preemption latency to sub-millisecond levels and preemption rate to at most once per online request via a GPU runtime using channel-controlled compute isolation, page-fault-free memory reclamation, and dynamic reservation. These guarantees are realized with only one line of driver modification and twenty lines of framework patch. In a production deployment across 8,054 GPUs, Valve reports a 34.6% cluster utilization improvement (saving 2,170 GPUs) while incurring <5% TTFT increase and <2% TPOT increase.
Significance. If the empirical results and minimal-modification claims hold under scrutiny, the work would offer a practically significant contribution to GPU cluster management for latency-sensitive LLM serving. The production-scale deployment and quantified savings address real underutilization from bursty traffic while preserving service-level objectives, potentially influencing how inference workloads are scheduled in large fleets.
major comments (3)
- [Abstract] Abstract: The headline result of 34.6% utilization improvement and 2,170 GPU savings on 8,054 GPUs is presented without any description of the baseline configuration, workload mix, or measurement methodology (e.g., how utilization was sampled, what the offline workload consisted of, or how interference was quantified under load). This absence makes the central empirical claim impossible to evaluate from the given text.
- [Abstract] Abstract: The assertion that channel-controlled isolation, page-fault-free reclamation, and dynamic reservation jointly bound preemption latency and rate while requiring only one driver line and twenty framework lines is stated at a high level. No concrete description of those changes, no argument that they preserve correctness under concurrent kernels or model heterogeneity, and no ablation showing the latency/rate bounds hold on the deployed workloads is supplied.
- [Abstract] Abstract: The interference figures (<5% TTFT, <2% TPOT) are given without reference to the specific online workloads, request rates, or how preemption events were triggered and measured during the production run, leaving the joint bounding claim unsupported by visible evidence.
minor comments (2)
- [Abstract] Abstract contains grammatical errors: 'guaranties' should be 'guarantees' and 'This efficiency gains is' should be 'This efficiency gain is'.
- [Abstract] The abstract would benefit from a brief sentence on the target GPU driver and framework versions to contextualize the minimal-patch claim.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the abstract. We agree that the abstract should be more self-contained to allow evaluation of the central claims without requiring the full text. We have revised the abstract to incorporate brief descriptions of the baseline, workload, measurement approach, mechanisms, and interference evaluation while preserving conciseness. The full manuscript (Sections 3–6) supplies the detailed arguments, ablations, and production traces referenced below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline result of 34.6% utilization improvement and 2,170 GPU savings on 8,054 GPUs is presented without any description of the baseline configuration, workload mix, or measurement methodology (e.g., how utilization was sampled, what the offline workload consisted of, or how interference was quantified under load). This absence makes the central empirical claim impossible to evaluate from the given text.
Authors: We agree the abstract requires additional context. The baseline is a non-colocated deployment in which offline jobs execute only on GPUs that are idle after satisfying all online SLOs, with no preemption. The workload is production online inference traffic (bursty, 20–40% average GPU utilization) colocated with offline batch jobs using the same model families. Utilization was measured via DCGM counters sampled every 10 s over a 7-day production trace; the 34.6% gain is the reduction in total GPUs needed to meet the same TTFT/TPOT SLOs. We have updated the abstract to state: 'Compared with a non-colocated baseline that meets identical SLOs, Valve improves cluster utilization by 34.6% (saving 2,170 GPUs) on an 8,054-GPU fleet.' revision: yes
-
Referee: [Abstract] Abstract: The assertion that channel-controlled isolation, page-fault-free reclamation, and dynamic reservation jointly bound preemption latency and rate while requiring only one driver line and twenty framework lines is stated at a high level. No concrete description of those changes, no argument that they preserve correctness under concurrent kernels or model heterogeneity, and no ablation showing the latency/rate bounds hold on the deployed workloads is supplied.
Authors: The abstract summarizes the three primitives; Sections 3 and 4 of the manuscript describe them in detail. Channel-controlled isolation uses CUDA priority channels to enforce sub-millisecond compute preemption without kernel modification. Page-fault-free reclamation reserves a small memory pool per model and reclaims only at layer boundaries. Dynamic reservation sizes the pool at runtime based on model memory footprint. Correctness under concurrent kernels follows from hardware channel isolation; model heterogeneity is handled by per-model reservation. Section 6.3 presents an ablation on the production workload showing that removing any primitive violates the sub-millisecond latency or once-per-request rate bound. We have added one sentence to the abstract: 'These bounds are realized via channel-controlled compute isolation, page-fault-free reclamation, and dynamic reservation, requiring only one driver line and twenty framework lines.' revision: yes
-
Referee: [Abstract] Abstract: The interference figures (<5% TTFT, <2% TPOT) are given without reference to the specific online workloads, request rates, or how preemption events were triggered and measured during the production run, leaving the joint bounding claim unsupported by visible evidence.
Authors: We agree the abstract should reference the measurement conditions. The figures are from the same 7-day production trace with online request rates up to 1,200 req/s; preemption events are triggered by online arrivals and measured end-to-end via the serving framework's latency instrumentation. TTFT and TPOT are compared against the non-colocated baseline under identical load. We have revised the abstract to read: 'These gains incur <5% TTFT and <2% TPOT increase on production workloads with request rates up to 1,200 req/s, where preemption is triggered by online arrivals and quantified via end-to-end latency traces.' revision: yes
Circularity Check
No circularity; empirical production deployment with no mathematical derivation chain
full rationale
The paper presents a systems contribution for online-offline GPU colocation in LLM inference. Central claims rest on a production deployment across 8,054 GPUs that reports measured utilization gains and latency bounds, not on any equations, fitted parameters, or self-referential definitions. No derivation chain exists that reduces results to inputs by construction; the runtime primitives (channel-controlled isolation, page-fault-free reclamation, dynamic reservation) are described as engineering implementations validated by deployment measurements rather than derived from prior fitted quantities or self-citations. The minimal code-change assertions are presented as practical facts supported by the reported outcomes, with no load-bearing self-citation or renaming of known results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption GPU hardware and drivers can support channel-controlled compute isolation and page-fault-free memory reclamation with minimal modification
invented entities (1)
-
Valve GPU runtime
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Gulavani, and Ramachandran Ramjee
Agrawal, A., Panwar, A., Mohan, J., Kwatra, N., Gulavani, B. S., and Ramjee, R. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills.arXiv preprint arXiv:2308.16369,
-
[2]
Duan, J., Lu, R., Duanmu, H., Li, X., Zhang, X., Lin, D., Stoica, I., and Zhang, H. Muxserve: Flexible spatial- temporal multiplexing for multiple llm serving.arXiv preprint arXiv:2404.02015,
-
[3]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
https://developer.nvidia.com/blog/ unified-memory-cuda-beginners/. Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Livecodebench: Holistic and contamination free eval- uation of large language models for code.arXiv preprint arXiv:2403.07974,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. Swe-bench: Can language mod- els resolve real-world github issues?arXiv preprint arXiv:2310.06770,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
RLAIF : Scaling reinforcement learning from human feedback with ai feedback
Lee, H., Phatale, S., Mansoor, H., Mesnard, T., Ferret, J., Lu, K., Bishop, C., Hall, E., Carbune, V ., Rastogi, A., et al. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback.arXiv preprint arXiv:2309.00267,
-
[7]
Towards swift serverless llm cold starts with paraserve
9 Valve: Production Online–Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate Lou, C., Qi, S., Jin, C., Nie, D., Yang, H., Liu, X., and Jin, X. Towards swift serverless llm cold starts with paraserve. arXiv preprint arXiv:2502.15524,
-
[8]
CUDA Multi-Process Service, 2026a
NVIDIA. CUDA Multi-Process Service, 2026a. https://docs.nvidia.com/deploy/mps/ index.html(accessed 2026-04-09). NVIDIA. ”nvidia multi-instance gpu (mig)”, 2026b. ”https://docs.nvidia.com/datacenter/ tesla/mig-user-guide/ (accessed 2026-04-09)”. OpenAI. Introducing ChatGPT, 2022.https://openai. com/blog/chatgpt. Patke, A., Reddy, D., Jha, S., Qiu, H., Pint...
-
[9]
Gon- zalez, Ion Stoica, and Harry Xu
Qiao, Y ., Anzai, S., Yu, S., Ma, H., Wang, Y ., Kim, M., and Xu, H. Conserve: Harvesting gpus for low-latency and high-throughput large language model serving.arXiv preprint arXiv:2410.01228,
-
[10]
Hybridflow: A flexible and efficient rlhf frame- work.EuroSys 2025 (30/03/2025-03/04/2025, Rotter- dam),
Wu, C. Hybridflow: A flexible and efficient rlhf frame- work.EuroSys 2025 (30/03/2025-03/04/2025, Rotter- dam),
2025
-
[11]
Xu, J., Zhang, R., Guo, C., Hu, W., Liu, Z., Wu, F., Feng, Y ., Sun, S., Shao, C., Guo, Y ., et al. vtensor: Flexible virtual tensor management for efficient llm serving.arXiv preprint arXiv:2407.15309,
-
[12]
Prism: Unleashing gpu sharing for cost-efficient multi-llm serving, 2025
Yu, S., Xing, J., Qiao, Y ., Ma, M., Li, Y ., Wang, Y ., Yang, S., Xie, Z., Cao, S., Bao, K., et al. Prism: Unleashing gpu sharing for cost-efficient multi-llm serving.arXiv preprint arXiv:2505.04021,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.