pith. machine review for the scientific record. sign in

arxiv: 2509.20979 · v2 · submitted 2025-09-25 · 💻 cs.LG

Recognition: unknown

Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

Authors on Pith no claims yet
classification 💻 cs.LG
keywords cachingperformancepredictionsemphlearning-augmentedtextscguaranteespractical
0
0 comments X
read the original abstract

In modern GPU inference, cache efficiency remains a major bottleneck, and heuristic policies such as \textsc{LRU} can perform far worse than the offline optimum. Existing learning-based caching systems improve hit rates mainly through predictor design, but often follow learned predictions blindly, making performance unreliable when predictions are inaccurate. In contrast, emerging learning-augmented caching algorithms~\cite{pmlr-v80-lykouris18a,mitzenmacher2022algorithms} provide performance guarantees by carefully integrating predictions into caching policies, achieving both \emph{consistency} (near-optimality under perfect predictions) and \emph{robustness} (bounded worst-case performance under prediction errors). However, deployment remains challenging. A practical algorithm should satisfy strict time and space efficiency constraints, which some theoretical work overlooks, while also incurring low deployment overhead. We propose learning-augmented LRU, a deployment-oriented learning-augmented caching algorithm that guarantees \emph{1-consistency} and \emph{$O(k)$-robustness}, incurs low time and space overhead, and maintains strong compatibility. We further build a GPU cache, called \textsc{LCR}, on top of learning-augmented LRU to benefit from its theoretical guarantees and translate them into practical performance. In experiments, \textsc{LCR} reduces P99 time-to-first-token (TTFT) by up to 28.3\% on LLM workloads and increases throughput by up to 24.2\% on deep learning recommendation (DLRM) workloads. Even with poor predictions, performance degrades gracefully and remains close to \textsc{LRU}, demonstrating robustness with practical value.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving

    cs.DC 2026-05 unverdicted novelty 6.0

    HELM adaptively partitions HBM between EMB and KV caches via a three-layer PPO controller and EMB-KV-aware scheduling, reducing P99 latency by 24-38% while achieving 93.5-99.6% SLO satisfaction on production workloads.

  2. SCION: Size-aware Policy Orchestration for Nonstationary Object Caches (Long Paper Version)

    cs.DC 2026-03 conditional novelty 5.0

    SCION is a lightweight orchestration layer that picks among six deployable cache policies via an offline-trained linear selector on short-prefix size and reuse fingerprints, improving cacheable miss ratio over SIEVE o...