pith. sign in

arxiv: 2605.28872 · v1 · pith:OCLGVB4Rnew · submitted 2026-05-23 · 💻 cs.NI

ReclaimNet: Reclaim-Aware Network Protocols for Voluntary GPU Sharing on Campus

classification 💻 cs.NI
keywords migrationanonymouscampusfailurehazardsmechanismsreclaim-awarereclaimnet
0
0 comments X
read the original abstract

University campuses host abundant but fragmented GPU resources whose voluntary sharing is blocked by a mismatch between revocable, autonomous ownership and migration mechanisms that assume stationary failure hazards, homogeneous interconnects, and unbounded transfer windows. We present ReclaimNet, a network-layer migration protocol suite that treats provider reclaim as a first-class contract rather than a failure case, combining three mechanisms: (i) reclaim-aware checkpoint scheduling that jointly adapts to time-varying departure hazards and contended bandwidth across co-resident jobs; (ii) volatility-aware destination selection integrating topology, survival probability, and notice-window feasibility; and (iii) deadline-aware migration traffic control with edge enforcement and a submillisecond TC BPF kill-switch. A two-month deployment on a 54-node heterogeneous campus testbed reduces work loss by 66% over Slurm preempt-and-requeue and 38% over pipeline-redundancy checkpointing, with 38% shorter downtime and under 3% degradation of background research traffic. The prototype is open-sourced at the anonymous repository https://anonymous.4open.science/r/ICNP2026-ReclaimNet/.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.