ReclaimNet: Reclaim-Aware Network Protocols for Voluntary GPU Sharing on Campus
read the original abstract
University campuses host abundant but fragmented GPU resources whose voluntary sharing is blocked by a mismatch between revocable, autonomous ownership and migration mechanisms that assume stationary failure hazards, homogeneous interconnects, and unbounded transfer windows. We present ReclaimNet, a network-layer migration protocol suite that treats provider reclaim as a first-class contract rather than a failure case, combining three mechanisms: (i) reclaim-aware checkpoint scheduling that jointly adapts to time-varying departure hazards and contended bandwidth across co-resident jobs; (ii) volatility-aware destination selection integrating topology, survival probability, and notice-window feasibility; and (iii) deadline-aware migration traffic control with edge enforcement and a submillisecond TC BPF kill-switch. A two-month deployment on a 54-node heterogeneous campus testbed reduces work loss by 66% over Slurm preempt-and-requeue and 38% over pipeline-redundancy checkpointing, with 38% shorter downtime and under 3% degradation of background research traffic. The prototype is open-sourced at the anonymous repository https://anonymous.4open.science/r/ICNP2026-ReclaimNet/.
This paper has not been read by Pith yet.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.