pith. machine review for the scientific record. sign in

cs.OS

Operating Systems

Roughly includes material in ACM Subject Classes D.4.1, D.4.2., D.4.3, D.4.4, D.4.5, D.4.7, and D.4.9.

0
cs.OS 2026-05-06

Pub/sub smart pointer limits reference updates to 0-1 per subscriber

ipc_shared_ptr: A Publish/Subscribe-Aware Smart Pointer for Cross-Process Object Lifetime Management

Specializing Birrell listing cuts global communication by an order of magnitude for zero-copy IPC at Autoware scale.

Figure from the paper full image
abstract click to expand
True zero-copy Inter-Process Communication (IPC) in publish/subscribe (pub/sub) middleware such as Robot Operating System 2 (ROS 2) requires subscribers to reference message objects in publisher-owned shared memory. Objects must not be reclaimed while referenced, yet must eventually be reclaimed, with correct handling of crash recovery and Transient Local QoS retention requirements. We propose ipc_shared_ptr, a pub/sub-aware smart pointer for cross-process message lifetime management. ipc_shared_ptr exploits pub/sub structural properties to specialize Birrell's reference listing, limiting global metadata updates to per-subscriber 0<->1 transitions and achieving an order-of-magnitude reduction in global communication over general-purpose distributed reference counting. We analyze the key metadata management tradeoff: scalability versus implementation simplicity. Owner-driven reclaim offers greater scalability, but concurrent membership changes and reclamation decisions produce races that widen the correctness-verification state space. Single-writer achieves structural atomicity, eliminating this complexity at the cost of a centralized bottleneck. iceoryx2 (owner-driven reclaim) and Agnocast -- a true zero-copy ROS 2 IPC middleware sharing the publisher's heap with subscribers and adopting ipc_shared_ptr with single-writer -- embody each architecture. Comparative evaluation at the scale of Autoware -- the largest open-source ROS 2 application -- confirms that single-writer achieves sufficient scalability: at 200 topics, two subscribers per topic and 100 Hz, Agnocast's E2E p99.9 is 2.9x lower than iceoryx2's, justifying implementation simplicity over owner-driven reclaim.
0
0
cs.OS 2026-05-06

GPU-centric store makes SSD KV cache match DRAM speed

Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving

Tutti removes CPU bottlenecks to cut TTFT 78 percent, double request rates, and lower serving cost 27 percent for long-context LLMs.

Figure from the paper full image
abstract click to expand
LLM serving relies on prefix caching to improve inference performance. As growing contexts push key-value (KV) cache footprint far beyond GPU HBM and CPU DRAM capacity, KV cache is increasingly offloaded to NVMe SSDs. Unfortunately, restoring KV cache from SSDs suffers from poor I/O performance and incurs significant GPU stalls. This is primarily because the fragmented GPU memory layout results in a massive number of tiny random I/Os, rendering the low-parallelism CPU a severe bottleneck even with GPU Direct Storage (GDS), which still relies on CPU intervention to initiate each I/O and thus remains CPU-centric. This paper presents Tutti, an efficient SSD-backed KV caching solution that eliminates CPU intervention from the critical data and I/O control paths between HBM and SSDs. At the core of Tutti is a GPU-centric KV cache object store, in which the CPU is only responsible for asynchronously loading I/O kernels once per layer to the GPU. Tutti saturates NVMe SSD bandwidth and reduces GPU stalls to near zero through the following designs: (i) we provide a GPU-native object abstraction that enables bulk KV cache transfers and management; (ii) we re-architect the GPU storage stack by introducing GPU io_uring to support asynchronous GPU direct object I/O; and (iii) we propose slack-aware I/O scheduling to avoid GPU resource contention. We have implemented Tutti and integrated it to vLLM. Extensive evaluation shows that compared to the state-of-the-art GDS-enabled, SSD-backed LMCache, Tutti reduces TTFT by 78.3% under strict SLO constraints and improves the achievable request rate by 2x. The serving cost is reduced by 27%. Tutti achieves nearly the same inference performance as DRAM-backed LMCache, while providing almost infinite capacity.
0
0
cs.OS 2026-05-05

Three-tier API governs urban sensor data with privacy tiers

CityOS: Privacy Architecture for Urban Sensing

Untrusted apps access data from local raw feeds to citywide aggregates while devices enforce per-user privacy budgets.

Figure from the paper full image
abstract click to expand
Cities are rapidly deploying sensing infrastructure -- cameras, environmental sensors, and connected kiosks -- that continuously observe public spaces, yet they lack a system architecture governing how applications access, aggregate, and retain this data, creating privacy risks and preventing consistent policy enforcement. We present CityOS, an operating system for urban sensing that mediates application access to sensor data through a three-tier API inspired by structured, privacy-conscious web interfaces. The tiers expand the spatial scope of data access while imposing progressively stronger privacy constraints: On-Scene supports real-time sensing with raw data confined to the local context; Single-Locality Aggregation enables differentially private longitudinal statistics at a fixed location; and Cross-Locality Aggregation supports citywide analytics via aggregation across locations, with user devices enforcing per-user privacy budgets. CityOS runs as an edge runtime that executes untrusted applications in ephemeral containers, enforcing these policies and providing transparency via broadcasts of differential privacy loss. We implement CityOS and applications across all tiers -- including pedestrian safety alerts, real-time and forecast parking availability, traffic dashboards, and subway trajectory measurement -- and show that it supports practical streetscape applications while enforcing strong privacy.
0
0
cs.OS 2026-05-04

VUDA delivers 85% higher throughput via CUDA-Vulkan spatial sharing

VUDA: Breaking CUDA-Vulkan Isolation for Spatial Sharing of Compute and Graphics on the Same GPU

Channel redirection and page-table grafting let compute and graphics run concurrently on one GPU without data copies or time slicing.

Figure from the paper full image
abstract click to expand
GPU-based simulation environments for embodied AI interleave physics simulation (CUDA) and photorealistic rendering (Vulkan) on a single device. We observe that two foundational scenarios -- simulation data generation and RL training -- can be naturally adapted to execute their simulation and rendering phases concurrently, presenting a significant opportunity to improve GPU utilization through spatial multiplexing. However, a fundamental obstacle we term execution isolation prevents this: CUDA and Vulkan create separate GPU contexts whose channels are bound to different scheduling groups, confining compute and graphics to mutually exclusive time slices. Existing spatial-sharing techniques are limited to the CUDA ecosystem, while temporal-sharing approaches underutilize available resources. This paper presents VUDA, a system that breaks execution isolation to enable spatial parallelism between CUDA compute and Vulkan graphics workloads. VUDA is built on two key observations: although CUDA and Vulkan expose different programming abstractions, their execution paths converge to a common channel primitive at the driver and hardware level; meanwhile, their virtual-address spaces are inherently disjoint, making safe page-table merging feasible without remapping. VUDA exposes a thin API for developers to annotate co-schedulable CUDA streams, and realizes spatial sharing through channel redirection into Vulkan's scheduling domain and page-table grafting to unify address spaces, eliminating all data copying on the critical path. Experiments on representative embodied-AI workloads show that VUDA delivers up to 85% higher throughput than temporal-sharing baselines, while improving GPU utilization and reducing end-to-end latency.
0
0
cs.OS 2026-05-01

Agent sandboxes hit 100% recovery correctness at 87% less traffic

Crab: A Semantics-Aware Checkpoint/Restore Runtime for Agent Sandboxes

By inspecting OS effects per turn with eBPF, Crab skips most checkpoints and overlaps the rest with LLM wait time.

Figure from the paper full image
abstract click to expand
Autonomous agents act through sandboxed containers and microVMs whose state spans filesystems, processes, and runtime artifacts. Checkpoint and restore (C/R) of this state is needed for fault tolerance, spot execution, RL rollout branching, and safe rollback-yet existing approaches fall into two extremes: application-level recovery preserves chat history but misses OS-side effects, while full per-turn checkpointing is correct but too expensive under dense co-location. The root cause is an agent-OS semantic gap: agent frameworks see tool calls but not their OS effects; the OS sees state changes but lacks turn-level context to judge recovery relevance. This gap hides massive sparsity: over 75% of agent turns produce no recovery-relevant state, so most checkpoints are unnecessary. Crab (Checkpoint-and-Restore for Agent SandBoxes) is a transparent host-side runtime that bridges this gap without modifying agents or C/R backends. An eBPF-based inspector classifies each turn's OS-visible effects to decide checkpoint granularity; a coordinator aligns checkpoints with turn boundaries and overlaps C/R with LLM wait time; and a host-scoped engine schedules checkpoint traffic across co-located sandboxes. On shell-intensive and code-repair workloads, Crab raises recovery correctness from 8% (chat-only) to 100%, cuts checkpoint traffic by up to 87%, and stays within 1.9% of fault-free execution time.
0
0
cs.OS 2026-05-01

Affinity hints give 12% throughput boost on chiplet servers

Affinity Tailor: Dynamic Locality-Aware Scheduling at Scale

Userspace estimates demand and picks compact CPU groups as soft preferences, beating Linux CFS while keeping utilization high.

Figure from the paper full image
abstract click to expand
Modern large multicore systems often run multiple workloads that share CPUs under schedulers such as Linux CFS. To keep CPUs busy, these schedulers load-balance runnable work, causing each workload to execute on many cores. This weakens locality at the microarchitectural level: workloads lose reuse in caches, branch predictors, and prefetchers, and interfere more with one another - especially on chiplet-based systems, where spreading execution across cores also spreads it across LLC boundaries. A natural alternative is strict CPU partitioning, but hard partitions leave capacity idle when workloads do not fully use their reserved CPUs. We present Affinity Tailor, a userspace-guided kernel scheduling system built on a key insight: the kernel can preserve locality for workloads that share CPUs by treating demand-sized, topologically compact CPU sets as affinity hints rather than hard partitions. A userspace controller estimates each workload's CPU demand online and assigns a preferred CPU set sized to that demand, chosen to be as disjoint as possible from other workloads while spanning as few LLC domains as possible. The kernel then uses this set as an affinity hint, steering threads toward those CPUs while still allowing execution elsewhere when needed to preserve utilization. Deployed at Google, Affinity Tailor delivers geometric-mean per-CPU throughput gains of 12% on chiplet-based systems and 3% on non-chiplet systems over Linux CFS. Furthermore, faster execution reduces memory residency, yielding per-GB throughput gains of 3-7%. Our findings suggest that future schedulers should treat spatial locality as a first-class objective, even at the expense of work-conservation.
0
0
cs.OS 2026-05-01

WebAssembly capsules run updatable code on tiny microcontrollers

treVM: Tiny Rust Embedded Virtual Machines with WASM on Variable Resource-Constrained Hardware

A Rust scheme hosts customizable logic on Arm, RISC-V and Xtensa hardware with secure network updates.

Figure from the paper full image
abstract click to expand
Software stacks embedded on microcontroller-based hardware typically provide rudimentary APIs programmed in C/C++, basic connectivity and, sometimes, a firmware update mechanism. Such coarse mechanisms contrast with widely used APIs and more advanced networked interaction expected from software stacks deployed on less resource-constrained hardware (microprocessor-based). In this paper, we aim to bridge this gap by designing treVM, a generic scheme to host high-level WebAssembly code capsules, bolted on a general-purpose Rust embedded software platform, able to run on a large variety of 32-bit microcontrollers. Not only can treVM capsules host highly customizable business logic, but capsules can also be securely updated on demand over the network, on devices already deployed in the field. We implement treVM in Rust, on top of Ariel OS, a general-purpose RTOS, and we publish the code as open source. Based on our implementation, we validate the feasibility of treVM on commonly available boards, and we report on extensive benchmarks we performed on heterogeneous hardware including Arm Cortex-M, RISC-V, and Xtensa microcontroller architectures. As such, treVM provides a promising new framework to secure continuous deployment of embedded software on low-power networked devices.
0
0
cs.OS 2026-04-29

Rust matches C on microcontroller firmware size and speed

Embedded Rust or C Firmware? Lessons from an Industrial Microcontroller Use Case with Ariel OS

Industrial case study finds no performance edge for C, with Rust runtime smaller than traditional bare-metal stacks.

abstract click to expand
As Rust gains traction for developing safer systems software, a reality check for the microcontroller hardware segment becomes necessary. How ready is the Rust ecosystem for this segment? Can Rust compete with C in practice? This paper reports on an IoT industrial case study that contributes to answering these questions. Two teams concurrently developing the same functionality (one in C, one in Rust) are analyzed over a period of several months. A comparative analysis of their approaches, results, and iterative efforts is provided. The analysis and measurements on hardware indicate no strong reason to prefer C over Rust for microcontroller firmware on the basis of memory footprint or execution speed. Furthermore, Ariel OS is shown to provide an efficient and portable system runtime in Rust whose footprint is smaller than that of the state-of-the-art bare-metal C stack traditionally used in this context. It is concluded that Rust is a sound choice today for firmware development in this domain.
0
0
cs.OS 2026-04-21

Processes and pipes made lightweight for far memory accelerators

Proxics: an efficient programming model for far memory accelerators

Compilation and interconnect protocols let NDP devices run familiar OS abstractions, cutting bandwidth for databases and graphs while making

Figure from the paper full image
abstract click to expand
The use of disaggregated or far memory systems such as CXL memory pools has renewed interest in Near-Data Processing (NDP): situating cores close to memory to reduce bandwidth requirements to and from the CPU. Hardware designs for such accelerators are appearing, but there lack clean, portable OS abstractions for programming them. We propose a programming model for NDP devices based on familiar OS abstractions: virtual processors (processes) and inter-process communication channels (like Unix pipes). While appealing from a user perspective, a naive implementation of such abstractions is inappropriate for NDP accelerators: the paucity of processing power in some hardware designs makes classical processes overly heavyweight, and IPC based on shared buffers makes no sense in a system designed to reduce memory bandwidth. Accordingly, we show how to implement these abstractions in a lightweight and efficient manner by exploiting compilation and interconnect protocols. We demonstrate them with a real hardware platform runing applications with a range of memory access patterns, including bulk memory operations, in-memory databases and graph applications. Crucially, we show not only the benefits over CPU-only implementations, but also the critical importance of efficient, low-latency communication channels between CPU and NDP accelerators, a feature largely neglected in existing proposals.
0
0
cs.OS 2026-04-16

Filesystem lets AI agents self-correct file mistakes

Don't Let AI Agents YOLO Your Files: Shifting Information and Control to Filesystems for Agent Safety and Autonomy

Staging holds changes for review while snapshots let agents detect and undo their own side effects, cutting user prompts on routine tasks.

Figure from the paper full image
abstract click to expand
AI coding agents operate directly on users' filesystems, where they regularly corrupt data, delete files, and leak secrets. Current approaches force a tradeoff between safety and autonomy: unrestricted access risks harm, while frequent permission prompts burden users and block agents. To understand this problem, we conduct the first systematic study of agent filesystem misuse, analyzing 290 public reports across 13 frameworks. Our analysis reveals that today's agents have limited information about their filesystem effects and insufficient control over them. We therefore argue for shifting this information and control to the filesystem itself. Based on this principle, we design YoloFS, an agent-native filesystem with three techniques. Staging isolates all mutations before commit, giving users corrective control. Snapshots extend this control to agents, letting them detect and correct their own mistakes. Progressive permission provides users with preventive control by gating access with minimal interaction. To evaluate YoloFS, we introduce a new methodology that captures user-agent-filesystem interactions. On 11 tasks with hidden side effects, YoloFS enables agent self-correction in 8 while keeping all effects staged and reviewable. On 112 routine tasks, YoloFS requires fewer user interactions while matching the baseline success rate.
0
0
cs.OS 2026-04-15

eBPF hooks decide page moves in tiered memory for up to 17% higher throughput

TierBPF: Page Migration Admission Control for Tiered Memory via eBPF

Lightweight profiling lets existing systems weigh page size and hardware topology before migrating, improving results on 17 workloads.

Figure from the paper full image
abstract click to expand
Existing software-based memory tiering systems decide which pages to place on the slower or faster tier. However, they do not take into account two important factors that greatly influence application performance: the size of the migrated pages, and the underlying hardware device and tiering topology. We introduce TierBPF, a software mechanism that can be plugged into existing memory tiering systems to take these factors into account, by making simple binary page admission decisions. TierBPF is implemented as a set of eBPF hooks, which allow users to define their own custom policies. In order to make its decisions, TierBPF utilizes a lightweight tracking mechanism for page profiling which is not dependent on the application's working set size. TierBPF, integrated into three memory tiering systems and evaluated with 17 workloads, achieves geomean throughput gains of up to 17.7% with improvements of up to 75% for individual workloads.
0
0
cs.OS 2026-04-15

MARS cuts agentic latency by 5.94x via co-scheduling

MARS: Efficient, Adaptive Co-Scheduling for Heterogeneous Agentic Systems

Global coordination of GPU inference and CPU tools in multi-turn loops keeps throughput high and speeds real coding agents by 1.87x.

Figure from the paper full image
abstract click to expand
Large language models (LLMs) are increasingly deployed as the execution core of autonomous agents rather than as standalone text generators. Agentic workloads induce a temporal shift from single-turn inference to multi-turn LLM-tool loops, and a spatial shift from chat-scale, GPU-only execution to repository-scale, GPU-CPU co-located execution. Consequently, coordinating heterogeneous resource demands of agentic execution has emerged as a critical system challenge. We design and implement MARS, an efficient and adaptive co-scheduling system that globally coordinates heterogeneous agentic workloads under coupled GPU-CPU resource pressure. By establishing holistic visibility across GPU inference and CPU tool execution via a unified information stream, an external control plane in MARS decouples admission from execution to prevent heterogeneous resource oversubscription. An internal agent-centric scheduler further minimizes the end-to-end critical path by prioritizing latency-sensitive continuations and adaptively retaining KV cache state only when warm resumption yields a latency benefit. Our evaluations show that MARS reduces end-to-end latency by up to 5.94x while maintaining nearly maximal system throughput. We further integrate MARS as the serving backend for the OpenHands coding agent framework, demonstrating its real-world effectiveness by accelerating end-to-end task completion time by up to 1.87x. Our source code will be publicly available soon.
0
0
cs.OS 2026-04-15

Hybrid tuning raises tiered memory performance up to 30%

Hybrid Adaptive Tuning for Tiered Memory Systems

PTMT pairs an offline performance database with online reinforcement learning to choose better parameters for profiling and page migration.

Figure from the paper full image
abstract click to expand
Memory tiering provides a cost-effective solution to increase memory capacity, utilization, and even bandwidth. Memory tiering relies on system software for memory profiling, detection of frequently accessed pages, and page migration. Such a system software often comes with system parameters. The configurations of those parameters impact application performance. We comprehensively classify system parameters, and characterize the sensitivity of application performance to them using representative memory tiering solutions. Furthermore, we introduce a lightweight and user-friendly framework PTMT, which automates tuning of parameters at runtime for various memory tiering solutions. We identify major challenges for online tuning of memory tiering. PTMT uses a hybrid "offline + online" tuning method: while the offline phase builds a performance database for online queries and reduces runtime overhead, the online phase uses reinforcement learning (customized to memory tiering) to tune. PTMT improves performance by 30%, 26%, 21%, and 14%, on four memory tiering solutions (TPP, UPM, Colloid, and AutoNUMA), compared to using the default configurations. PTMT outperforms the state-of-the-art by 32% on average.
0
0
cs.OS 2026-04-14

Kernel reads one logit to classify AI agent actions

ProbeLogits: Kernel-Level LLM Inference Primitives for AI-Native Operating Systems

Single-pass method on base models reaches 97-99 percent block rates and matches dedicated guard models with lower latency below the sandbox.

abstract click to expand
An OS kernel that runs LLM inference internally can read logit distributions before any text is generated and act on them as a governance primitive. This paper presents ProbeLogits, a kernel-level operation that performs a single forward pass and reads specific token logits to classify agent actions as safe or dangerous, with zero learned parameters. I evaluate ProbeLogits across three base models (Qwen 2.5-7B, Llama 3 8B, Mistral 7B) on three external benchmarks: HarmBench, XSTest, and ToxicChat. On HarmBench non-copyright (n=300), all three models reach 97-99% block rate with the right verbalizer. On ToxicChat (n=1,000), ProbeLogits achieves F1 parity-or-better against Llama Guard 3 in the same hosted environment: the strongest configuration (Qwen 2.5-7B Safe/Dangerous, alpha=0.0) reaches F1=0.812 with bootstrap 95% CIs disjoint from LG3 (+13.7pp significant); Llama 3 S/D matches LG3 within CI (+0.4pp, parity); Mistral Y/N exceeds by +4.4pp. Latency is approximately 2.5x faster than LG3 in the same hosted environment because the primitive reads a single logit position instead of generating tokens; in the bare-metal native runtime ProbeLogits drops to 65 ms. A key design contribution is the calibration strength alpha, which serves as a deployment-time policy knob rather than a learned hyperparameter. Contextual calibration corrects verbalizer prior asymmetry, with bias magnitude varying by (model, verbalizer) pair. I implement ProbeLogits within Anima OS, a bare-metal x86_64 OS written in approximately 86,000 lines of Rust. Because agent actions must pass through 15 kernel-mediated host functions, ProbeLogits enforcement operates below the WASM sandbox boundary, making it significantly harder to circumvent than application-layer classifiers.
1 0
0
cs.OS 2026-04-14

Nanvix cuts serverless server needs by 20-100x

Nanvix: A Multikernel OS Design for High-Density Serverless Deployments

Split-kernel OS keeps per-invocation state separate from shared tenant state to raise density while retaining isolation

Figure from the paper full image
abstract click to expand
Serverless providers strive for high resource utilization by optimizing deployment density: how many applications can be deployed per host server. However, achieving high deployment density without compromising application performance or isolation remains an open challenge. High density can be achieved by sharing components across applications, yet applications from different tenants must be strongly isolated from each other due to the risk of side-channel attacks. Sharing components across applications from the same tenant, if done naively, can introduce contention on host resources thus negatively affecting application performance. We describe Nanvix, a new multikernel OS that disaggregates ephemeral execution state, unique per application invocation, from long-lived persistent state, shared among invocations from the same tenant. Applications in Nanvix execute inside a lightweight user VM running a micro-kernel that implements threads and memory, and forwards all I/O requests to a system VM. The system VM runs a macro-kernel with a rich set of device drivers and is shared among all invocations from the same tenant. Nanvix' split design achieves strong hypervisor isolation across tenants without sacrificing application performance, and reduces same-tenant contention by multiplexing all I/O requests to the system VM. Thanks to a system-wide co-design, Nanvix achieves order-of-magnitude lower application start up times with moderate I/O overheads. When replaying a production trace, Nanvix needs 20-100x fewer host servers compared to state-of-the-art systems, improving deployment density
0
0
cs.OS 2026-04-13

Adaptive quantization cuts mobile LLM cold starts by 4x

EdgeFlow: Fast Cold Starts for LLMs on Mobile Devices

EdgeFlow lowers precision on less critical weights to speed loading from flash while preserving accuracy on phone NPUs.

Figure from the paper full image
abstract click to expand
Deploying large language models (LLMs) on mobile devices is an emerging trend to enable data privacy and offline accessibility of LLM applications. Modern mobile neural processing units (NPUs) make such deployment increasingly feasible. However, existing mobile LLM inference frameworks suffer from high start-up latency due to their inevitable cold starts, i.e., launching LLM inferences when the model is not hosted in device memory. In this paper, we identify the key bottleneck of mobile LLM cold starts as the waste of flash bandwidth on unimportant model parameters. We design EdgeFlow, a mobile LLM inference framework that mitigates the cold start issue by adaptively adjusting the precisions of LLM parameters. Specifically, EdgeFlow leverages 1) an NPU-aware adaptive quantization algorithm that assigns different precisions to weights in a finer granularity according to their importance and NPU constraints, 2) an SIMD-friendly packing format that accelerates the transformation of various-precision weights into fixed-sized NPU-native data types, and 3) a synergistic granular pipeline that coordinates CPU and NPU computation in a fine-grained and dynamic manner. Experimental results show that EdgeFlow reduces cold-start latency by up to 4.07x compared with three state-of-the-art mobile LLM inference frameworks, i.e., llama.cpp, MNN, and llm.npu, under comparable model accuracy.
0
0
cs.OS 2026-04-10 Recognition

Valve saves 2,170 GPUs by colocating online and offline inference

Valve: Production Online-Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate

System jointly bounds preemption to sub-millisecond latency and once-per-request rate with minimal code changes.

Figure from the paper full image
abstract click to expand
LLM inference powers latency-critical production services nowadays. The bursty nature of inference traffic results in over-provisioning, which in turn leads to resource underutilization. While online-offline colocation promises to utilize idle capacity, broad production deployment must overcome two major challenges: (i) large online interference due to slow or frequent preemptions, and (ii) extensive frameworks and drivers modifications, to colocate different models and support preemptions. We present Valve, a production-friendly colocation system that jointly bounds preemption latency and preemption rate. Specifically, Valve enables sub-millisecond compute preemption at most once per online request, and rate-limited sub-layer memory reclamation. These guaranties are provided by a GPU runtime that combines channel-controlled compute isolation, page-fault-free memory reclamation, and dynamic memory reservation. Critically, Valve is practical to deploy, requiring one line of driver modification and 20 lines of framework patch. Deployed on 8,054 GPUs in production, Valve improves cluster utilization by 34.6%, which translates to a 2,170 GPU save. This efficiency gains is achieved with minimal online interference, incurring <5% TTFT increase and <2% TPOT increase across workloads.
0
0
cs.OS 2026-04-03 Recognition

Migratable actors on CXL SSDs dodge thermal cliffs

WIO: Upload-Enabled Computational Storage on CXL SSDs

WebAssembly storage actors move between host and device for up to 2x throughput and 3.75x lower write latency without app changes.

Figure from the paper full image
abstract click to expand
The widening gap between processor speed and storage latency has made data movement a dominant bottleneck in modern systems. Two lines of storage-layer innovation attempted to close this gap: persistent memory shortened the latency hierarchy, while computational storage devices pushed processing toward the data. Neither has displaced conventional NVMe SSDs at scale, largely due to programming complexity, ecosystem fragmentation, and thermal/power cliffs under sustained load. We argue that storage-side compute should be \emph{reversible}: computation should migrate dynamically between host and device based on runtime conditions. We present \sys, which realizes this principle on CXL SSDs by decomposing I/O-path logic into migratable \emph{storage actors} compiled to WebAssembly. Actors share state through coherent CXL.mem regions; an agility-aware scheduler migrates them via a zero-copy drain-and-switch protocol when thermal or power constraints arise. Our evaluation on an FPGA-based CXL SSD prototype and two production CSDs shows that \sys turns hard thermal cliffs into elastic trade-offs, achieving up to 2$\times$ throughput improvement and 3.75$\times$ write latency reduction without application modification.
0

browse all of cs.OS → full archive · search · sub-categories