arxiv: 2604.07874 · v1 · submitted 2026-04-09 · 💻 cs.OS

Recognition: no theorem link

Valve: Production Online-Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate

Chong Zha, Fangyue Liu, Hanmei Luo, Hao Liang, Hua Liu, Lingpeng Chen, Peng Chen, Shuo Ai, Xin Jin, Xinyuan Lyu, Ziqian Hu

Pith reviewed 2026-05-10 18:00 UTC · model grok-4.3

classification 💻 cs.OS

keywords LLM inferenceGPU colocationpreemption latencyonline-offline schedulingcluster utilizationproduction system

0 comments

The pith

Valve enables safe colocation of online and offline LLM inference on GPUs by jointly bounding preemption latency and rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Valve as a system to address underutilization in LLM inference clusters caused by bursty traffic. By colocating online latency-critical services with offline jobs, it aims to improve resource efficiency while keeping interference low. The key is a set of guarantees on preemption: sub-millisecond compute preemption at most once per online request, combined with controlled memory reclamation. This is implemented with minimal changes to existing drivers and frameworks, and has been validated in a large-scale production environment showing substantial GPU savings.

Core claim

Valve is a production-friendly colocation system that jointly bounds preemption latency and preemption rate. Specifically, it enables sub-millisecond compute preemption at most once per online request, and rate-limited sub-layer memory reclamation. These guarantees are provided by a GPU runtime that combines channel-controlled compute isolation, page-fault-free memory reclamation, and dynamic memory reservation. It requires only one line of driver modification and 20 lines of framework patch.

What carries the argument

A GPU runtime combining channel-controlled compute isolation, page-fault-free memory reclamation, and dynamic memory reservation, which jointly bounds preemption latency to sub-millisecond and preemption rate to at most one per online request.

If this is right

Improves cluster utilization by 34.6%, equivalent to saving 2,170 GPUs in the deployed cluster.
Incurs less than 5% increase in TTFT and less than 2% in TPOT across workloads.
Enables deployment on 8,054 GPUs in production with minimal code changes.
Supports colocation of different models while maintaining preemption guarantees.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar bounding techniques could extend to other hardware accelerators if analogous runtime primitives exist.
Reducing preemption overhead this way may allow more aggressive colocation policies in future scheduling systems.
The minimal modification requirement suggests Valve could be adopted broadly without major framework overhauls.

Load-bearing premise

The GPU runtime primitives for isolation, reclamation, and reservation can be implemented with one line of driver change and twenty lines of framework patch while maintaining correctness for all production models and drivers.

What would settle it

A measurement in production showing either preemption latency greater than one millisecond or more than one preemption per online request would falsify the joint bounding claim.

Figures

Figures reproduced from arXiv: 2604.07874 by Chong Zha, Fangyue Liu, Hanmei Luo, Hao Liang, Hua Liu, Lingpeng Chen, Peng Chen, Shuo Ai, Xin Jin, Xinyuan Lyu, Ziqian Hu.

**Figure 1.** Figure 1: Comparison of online-offline colocation approaches. user-facing applications, inference has also become a building block for training workflows, including data curation, post-training actor rollouts, and critic scoring (Guo et al., 2025; Lee et al., 2023). Despite their importance, LLM inference clusters in production still suffer from low utilization. This is primarily because operators must provision f… view at source ↗

**Figure 2.** Figure 2: Coefficient of Variation (CV) analysis of GPU utilization and KV cache utilization. 2. Background We first analyze why GPU utilization is low in production by characterizing workload burstiness and SLA requirements. We then derive key requirements for production-friendly online–offline inference colocation, review existing systems and their limitations, and outline the main challenges. 2.1. Low GPU Utiliza… view at source ↗

**Figure 4.** Figure 4: Distribution of gap interval between decode iterations. offline throughput. Gpreempt (Fan et al., 2025) uses a CUDA-driver timeslice for automatic switching, but the decode phase has short gaps between iterations ( [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 6.** Figure 6: Channel control in the kernel launch path. with fast sub-layer memory reclamation coordinated with compute preemption, and controls reclamation frequency via MIAD(Multiple Increase, Addictive Decrease)-style online reservation (§5). These mechanisms also preserve high offline throughput by harvesting most idle compute cycles; during memory reclamation, selective eviction affects fewer offline requests and… view at source ↗

**Figure 7.** Figure 7: Sub-layer memory reclamation with dynamic online memory reservation. ling offline compute. Channel control for portable compute isolation. Valve uses this ioctl-level control point to preempt and resume offline workloads with low latency. Specifically, it disables the offline workload’s channel to preempt execution and later reenables it to resume. On Pascal+ GPUs, disabling a channel triggers a hardware… view at source ↗

**Figure 8.** Figure 8: Cluster GPU utilization with Valve. Nov 01 Dec 01 Jan 01 (b) Deployment scale growth 0 1800 3600 5400 7200 9000 Deployment Scale Date [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Deployment scale with Valve. 7.1. Production Impact of Valve Deployment. Valve is deployed in production clusters with 8045 GPU cards. The clusters serve both online and offline inference workloads. Valve has been in production for more than three months, with the number of managed GPU cards increasing over time, illustrated by [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 11.** Figure 11: Effectiveness of the eviction selection. 8. Related Work Autoscaling. Serverless autoscaling systems (e.g., BlitzScale, ServerlessLLM, and HydraServe) unload model weights when idle and reload them on demand (Zhang et al., 2025; Lou et al., 2025; Fu et al., 2024). This reduces steadystate GPU memory footprint, but the on-demand reload can introduce cold-start delays (e.g., TTFT spikes) under highly burs… view at source ↗

read the original abstract

LLM inference powers latency-critical production services nowadays. The bursty nature of inference traffic results in over-provisioning, which in turn leads to resource underutilization. While online-offline colocation promises to utilize idle capacity, broad production deployment must overcome two major challenges: (i) large online interference due to slow or frequent preemptions, and (ii) extensive frameworks and drivers modifications, to colocate different models and support preemptions. We present Valve, a production-friendly colocation system that jointly bounds preemption latency and preemption rate. Specifically, Valve enables sub-millisecond compute preemption at most once per online request, and rate-limited sub-layer memory reclamation. These guaranties are provided by a GPU runtime that combines channel-controlled compute isolation, page-fault-free memory reclamation, and dynamic memory reservation. Critically, Valve is practical to deploy, requiring one line of driver modification and 20 lines of framework patch. Deployed on 8,054 GPUs in production, Valve improves cluster utilization by 34.6%, which translates to a 2,170 GPU save. This efficiency gains is achieved with minimal online interference, incurring <5% TTFT increase and <2% TPOT increase across workloads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Valve gets real production gains from online-offline GPU colocation with almost no code changes, but needs more detail on how the minimal patches enforce the bounds.

read the letter

Valve shows how to colocate online LLM inference with offline jobs on GPUs while bounding preemption to sub-millisecond latency and at most once per request. The system uses a GPU runtime with channel-controlled isolation, page-fault-free reclamation, and dynamic reservation, all achieved with just one driver line and twenty framework lines. On 8,054 GPUs in production it lifted utilization by 34.6 percent with under 5 percent TTFT hit and under 2 percent TPOT hit.

Referee Report

3 major / 2 minor

Summary. The paper presents Valve, a production system for online-offline colocation of LLM inference on GPUs. It claims to jointly bound preemption latency to sub-millisecond levels and preemption rate to at most once per online request via a GPU runtime using channel-controlled compute isolation, page-fault-free memory reclamation, and dynamic reservation. These guarantees are realized with only one line of driver modification and twenty lines of framework patch. In a production deployment across 8,054 GPUs, Valve reports a 34.6% cluster utilization improvement (saving 2,170 GPUs) while incurring <5% TTFT increase and <2% TPOT increase.

Significance. If the empirical results and minimal-modification claims hold under scrutiny, the work would offer a practically significant contribution to GPU cluster management for latency-sensitive LLM serving. The production-scale deployment and quantified savings address real underutilization from bursty traffic while preserving service-level objectives, potentially influencing how inference workloads are scheduled in large fleets.

major comments (3)

[Abstract] Abstract: The headline result of 34.6% utilization improvement and 2,170 GPU savings on 8,054 GPUs is presented without any description of the baseline configuration, workload mix, or measurement methodology (e.g., how utilization was sampled, what the offline workload consisted of, or how interference was quantified under load). This absence makes the central empirical claim impossible to evaluate from the given text.
[Abstract] Abstract: The assertion that channel-controlled isolation, page-fault-free reclamation, and dynamic reservation jointly bound preemption latency and rate while requiring only one driver line and twenty framework lines is stated at a high level. No concrete description of those changes, no argument that they preserve correctness under concurrent kernels or model heterogeneity, and no ablation showing the latency/rate bounds hold on the deployed workloads is supplied.
[Abstract] Abstract: The interference figures (<5% TTFT, <2% TPOT) are given without reference to the specific online workloads, request rates, or how preemption events were triggered and measured during the production run, leaving the joint bounding claim unsupported by visible evidence.

minor comments (2)

[Abstract] Abstract contains grammatical errors: 'guaranties' should be 'guarantees' and 'This efficiency gains is' should be 'This efficiency gain is'.
[Abstract] The abstract would benefit from a brief sentence on the target GPU driver and framework versions to contextualize the minimal-patch claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We agree that the abstract should be more self-contained to allow evaluation of the central claims without requiring the full text. We have revised the abstract to incorporate brief descriptions of the baseline, workload, measurement approach, mechanisms, and interference evaluation while preserving conciseness. The full manuscript (Sections 3–6) supplies the detailed arguments, ablations, and production traces referenced below.

read point-by-point responses

Referee: [Abstract] Abstract: The headline result of 34.6% utilization improvement and 2,170 GPU savings on 8,054 GPUs is presented without any description of the baseline configuration, workload mix, or measurement methodology (e.g., how utilization was sampled, what the offline workload consisted of, or how interference was quantified under load). This absence makes the central empirical claim impossible to evaluate from the given text.

Authors: We agree the abstract requires additional context. The baseline is a non-colocated deployment in which offline jobs execute only on GPUs that are idle after satisfying all online SLOs, with no preemption. The workload is production online inference traffic (bursty, 20–40% average GPU utilization) colocated with offline batch jobs using the same model families. Utilization was measured via DCGM counters sampled every 10 s over a 7-day production trace; the 34.6% gain is the reduction in total GPUs needed to meet the same TTFT/TPOT SLOs. We have updated the abstract to state: 'Compared with a non-colocated baseline that meets identical SLOs, Valve improves cluster utilization by 34.6% (saving 2,170 GPUs) on an 8,054-GPU fleet.' revision: yes
Referee: [Abstract] Abstract: The assertion that channel-controlled isolation, page-fault-free reclamation, and dynamic reservation jointly bound preemption latency and rate while requiring only one driver line and twenty framework lines is stated at a high level. No concrete description of those changes, no argument that they preserve correctness under concurrent kernels or model heterogeneity, and no ablation showing the latency/rate bounds hold on the deployed workloads is supplied.

Authors: The abstract summarizes the three primitives; Sections 3 and 4 of the manuscript describe them in detail. Channel-controlled isolation uses CUDA priority channels to enforce sub-millisecond compute preemption without kernel modification. Page-fault-free reclamation reserves a small memory pool per model and reclaims only at layer boundaries. Dynamic reservation sizes the pool at runtime based on model memory footprint. Correctness under concurrent kernels follows from hardware channel isolation; model heterogeneity is handled by per-model reservation. Section 6.3 presents an ablation on the production workload showing that removing any primitive violates the sub-millisecond latency or once-per-request rate bound. We have added one sentence to the abstract: 'These bounds are realized via channel-controlled compute isolation, page-fault-free reclamation, and dynamic reservation, requiring only one driver line and twenty framework lines.' revision: yes
Referee: [Abstract] Abstract: The interference figures (<5% TTFT, <2% TPOT) are given without reference to the specific online workloads, request rates, or how preemption events were triggered and measured during the production run, leaving the joint bounding claim unsupported by visible evidence.

Authors: We agree the abstract should reference the measurement conditions. The figures are from the same 7-day production trace with online request rates up to 1,200 req/s; preemption events are triggered by online arrivals and measured end-to-end via the serving framework's latency instrumentation. TTFT and TPOT are compared against the non-colocated baseline under identical load. We have revised the abstract to read: 'These gains incur <5% TTFT and <2% TPOT increase on production workloads with request rates up to 1,200 req/s, where preemption is triggered by online arrivals and quantified via end-to-end latency traces.' revision: yes

Circularity Check

0 steps flagged

No circularity; empirical production deployment with no mathematical derivation chain

full rationale

The paper presents a systems contribution for online-offline GPU colocation in LLM inference. Central claims rest on a production deployment across 8,054 GPUs that reports measured utilization gains and latency bounds, not on any equations, fitted parameters, or self-referential definitions. No derivation chain exists that reduces results to inputs by construction; the runtime primitives (channel-controlled isolation, page-fault-free reclamation, dynamic reservation) are described as engineering implementations validated by deployment measurements rather than derived from prior fitted quantities or self-citations. The minimal code-change assertions are presented as practical facts supported by the reported outcomes, with no load-bearing self-citation or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The design rests on domain assumptions about GPU hardware behavior and introduces a new runtime implementation; no free parameters are fitted to data in the abstract.

axioms (1)

domain assumption GPU hardware and drivers can support channel-controlled compute isolation and page-fault-free memory reclamation with minimal modification
Invoked to achieve sub-millisecond preemption guarantees

invented entities (1)

Valve GPU runtime no independent evidence
purpose: Enforce jointly bounded preemption latency and rate while requiring only minimal driver and framework changes
New system component introduced to realize the colocation guarantees

pith-pipeline@v0.9.0 · 5553 in / 1352 out tokens · 53739 ms · 2026-05-10T18:00:12.987188+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 11 canonical work pages · 3 internal anchors

[1]

Gulavani, and Ramachandran Ramjee

Agrawal, A., Panwar, A., Mohan, J., Kwatra, N., Gulavani, B. S., and Ramjee, R. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills.arXiv preprint arXiv:2308.16369,

work page arXiv
[2]

Muxserve: flexible spatial-temporal multi- plexing for multiple llm serving.arXiv preprint arXiv:2404.02015, 2024

Duan, J., Lu, R., Duanmu, H., Li, X., Zhang, X., Lin, D., Stoica, I., and Zhang, H. Muxserve: Flexible spatial- temporal multiplexing for multiple llm serving.arXiv preprint arXiv:2404.02015,

work page arXiv
[3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

https://developer.nvidia.com/blog/ unified-memory-cuda-beginners/. Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Livecodebench: Holistic and contamination free eval- uation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. Swe-bench: Can language mod- els resolve real-world github issues?arXiv preprint arXiv:2310.06770,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

RLAIF : Scaling reinforcement learning from human feedback with ai feedback

Lee, H., Phatale, S., Mansoor, H., Mesnard, T., Ferret, J., Lu, K., Bishop, C., Hall, E., Carbune, V ., Rastogi, A., et al. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback.arXiv preprint arXiv:2309.00267,

work page arXiv
[7]

Towards swift serverless llm cold starts with paraserve

9 Valve: Production Online–Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate Lou, C., Qi, S., Jin, C., Nie, D., Yang, H., Liu, X., and Jin, X. Towards swift serverless llm cold starts with paraserve. arXiv preprint arXiv:2502.15524,

work page arXiv
[8]

CUDA Multi-Process Service, 2026a

NVIDIA. CUDA Multi-Process Service, 2026a. https://docs.nvidia.com/deploy/mps/ index.html(accessed 2026-04-09). NVIDIA. ”nvidia multi-instance gpu (mig)”, 2026b. ”https://docs.nvidia.com/datacenter/ tesla/mig-user-guide/ (accessed 2026-04-09)”. OpenAI. Introducing ChatGPT, 2022.https://openai. com/blog/chatgpt. Patke, A., Reddy, D., Jha, S., Qiu, H., Pint...

work page arXiv 2026
[9]

Gon- zalez, Ion Stoica, and Harry Xu

Qiao, Y ., Anzai, S., Yu, S., Ma, H., Wang, Y ., Kim, M., and Xu, H. Conserve: Harvesting gpus for low-latency and high-throughput large language model serving.arXiv preprint arXiv:2410.01228,

work page arXiv
[10]

Hybridflow: A flexible and efficient rlhf frame- work.EuroSys 2025 (30/03/2025-03/04/2025, Rotter- dam),

Wu, C. Hybridflow: A flexible and efficient rlhf frame- work.EuroSys 2025 (30/03/2025-03/04/2025, Rotter- dam),

2025
[11]

vtensor: Flexible virtual tensor management for efficient llm serving.arXiv preprint arXiv:2407.15309,

Xu, J., Zhang, R., Guo, C., Hu, W., Liu, Z., Wu, F., Feng, Y ., Sun, S., Shao, C., Guo, Y ., et al. vtensor: Flexible virtual tensor management for efficient llm serving.arXiv preprint arXiv:2407.15309,

work page arXiv
[12]

Prism: Unleashing gpu sharing for cost-eﬀicient multi-llm serving, 2025

Yu, S., Xing, J., Qiao, Y ., Ma, M., Li, Y ., Wang, Y ., Yang, S., Xie, Z., Cao, S., Bao, K., et al. Prism: Unleashing gpu sharing for cost-efficient multi-llm serving.arXiv preprint arXiv:2505.04021,

work page arXiv