pith. sign in

arxiv: 2509.20979 · v2 · submitted 2025-09-25 · 💻 cs.LG

Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

Pith reviewed 2026-05-18 14:24 UTC · model grok-4.3

classification 💻 cs.LG
keywords learning-augmented cachingGPU cacheLRULLM inferenceDLRMconsistencyrobustnessML-based caching
0
0 comments X p. Extension

The pith

Learning-augmented LRU guarantees 1-consistency and O(k)-robustness for GPU inference caches with low overhead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a caching algorithm that merges machine learning predictions with the classic LRU policy for use in GPU-based inference systems. The goal is to get the benefits of learned predictions for higher hit rates while ensuring the system does not fall too far behind standard LRU if the predictions turn out wrong. By proving 1-consistency and O(k)-robustness and keeping computational costs low, the method aims to make learning-based caching reliable enough for real deployment in large language model and recommendation model serving.

Core claim

We propose learning-augmented LRU, a deployment-oriented learning-augmented caching algorithm that guarantees 1-consistency and O(k)-robustness, incurs low time and space overhead, and maintains strong compatibility. We further build a GPU cache, called LCR, on top of learning-augmented LRU to benefit from its theoretical guarantees and translate them into practical performance. In experiments, LCR reduces P99 time-to-first-token by up to 28.3% on LLM workloads and increases throughput by up to 24.2% on DLRM workloads, with graceful degradation under poor predictions.

What carries the argument

Learning-augmented LRU, which integrates predictions into LRU eviction to achieve consistency and robustness guarantees while remaining efficient.

If this is right

  • Cache performance approaches the offline optimum when predictions are accurate.
  • Performance remains bounded relative to standard LRU under arbitrary prediction errors.
  • Low time and space overhead allows deployment in existing GPU inference pipelines.
  • The same guarantees apply across both LLM and DLRM inference workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Production inference systems could adopt this as a safer way to incorporate ML predictors into caching.
  • The robustness bound may allow the policy to replace LRU without additional monitoring of predictor quality.
  • The integration technique could be tested on other base policies beyond LRU.

Load-bearing premise

The theoretical consistency and robustness properties can be realized in a practical implementation that meets strict time and space efficiency constraints on real GPU hardware.

What would settle it

Measure LCR performance under deliberately inaccurate predictions on GPU hardware and check whether it stays within the stated O(k) factor of plain LRU or violates the bounds.

Figures

Figures reproduced from arXiv: 2509.20979 by Gang Xiong, Hailiang Zhao, Hao Li, Jiahong Yu, Jiaji Zhang, Jianping Zou, Kingsum Chow, Peng Chen, Shenyao Chen, Shuibing He, Shuiguang Deng, Xueyan Tang, Yirong Zhang, Yixuan Wang.

Figure 1
Figure 1. Figure 1: Goals of learning-based caching algorithms. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Key performance metrics differ between DLRMs and LLMs. (1) DLRMs process requests in tens of millisec￾onds, enabling high concurrency. System cost is dominated by throughput, making average latency the key optimiza￾tion target. (2) LLMs involve longer sequences and heavier computation, resulting in lower concurrency. User experience depends heavily on tail latency, particularly P95/P99 TTFT, which is stron… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of two GPU caching use cases. (1) DLRMs: Inputs contain dense features and sparse features (e.g., user ID, product category). Dense features are processed by a bottom MLP, while sparse features are mapped into embedding vectors via large pre-trained tables, often stored hierarchically in GPU HBM and DRAM. (2) LLMs: Each request prompt is a token sequence whose KV vectors are computed during th… view at source ↗
Figure 4
Figure 4. Figure 4: Cache hit rate gap between LRU and the offline optimum (OPT) under varying cache ratios. Conventional cache replacement policies struggle under dy￾namic inference workloads, motivating the use of ML to adapt to evolving access patterns. 2.2 Existing learning-based caching systems To address the limitations of heuristic policies, recent work has explored learning-based caching [12, 22, 29, 32, 34, 35, 38, 4… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of performance guarantees among rep [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cost of caching algorithms with different types of [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: LCR’s architecture and workflows. in [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: LARU’s algorithmic process on a cache miss. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Average SLS latency improvement across different cache ratios and batch sizes. FPB and LARU employ the same LightGBM predictor. Numbers in boxes denote LARU’s improvement over LRU. Results are reported for (a) AD-CTR-User and (b) QB-video datasets, evaluated at cache ratios of 1%, 2%, and 3%. The cache ratio is cache size over total distinct items. Predictor settings. Due to the high concurrency of DLRM wo… view at source ↗
Figure 10
Figure 10. Figure 10: Robustness Verification. Average SLS latency under different batch sizes and cache ratios (1%-3%) on (a) AD-CTR￾User and (b) QB-video datasets. Both FPB and LARU use the same low-accuracy LightGBM predictor to assess robustness. improves SLS throughput, particularly at the large batch sizes common in production environments. Robustness Verification. We empirically verify the ro￾bustness of LARU using an i… view at source ↗
Figure 11
Figure 11. Figure 11: P99 TTFT improvement of Qwen2.5-32B on the [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: P99 TTFT improvement of DeepSeek-R1-671B [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Robustness Verification. KV cache hit rate under injected noise on Aibrix-Synthetic (top) and Online-QA (bottom). The probability of replacing predictions with worst-case noise is varied from 0.1 to 1.0. overly conservatively, without fully benefiting from accurate predictions. In contrast, LARU degrades gracefully, consis￾tently maintaining higher hit rates than both FPB and LRU across all error levels. … view at source ↗
read the original abstract

In modern GPU inference, cache efficiency remains a major bottleneck, and heuristic policies such as \textsc{LRU} can perform far worse than the offline optimum. Existing learning-based caching systems improve hit rates mainly through predictor design, but often follow learned predictions blindly, making performance unreliable when predictions are inaccurate. In contrast, emerging learning-augmented caching algorithms~\cite{pmlr-v80-lykouris18a,mitzenmacher2022algorithms} provide performance guarantees by carefully integrating predictions into caching policies, achieving both \emph{consistency} (near-optimality under perfect predictions) and \emph{robustness} (bounded worst-case performance under prediction errors). However, deployment remains challenging. A practical algorithm should satisfy strict time and space efficiency constraints, which some theoretical work overlooks, while also incurring low deployment overhead. We propose learning-augmented LRU, a deployment-oriented learning-augmented caching algorithm that guarantees \emph{1-consistency} and \emph{$O(k)$-robustness}, incurs low time and space overhead, and maintains strong compatibility. We further build a GPU cache, called \textsc{LCR}, on top of learning-augmented LRU to benefit from its theoretical guarantees and translate them into practical performance. In experiments, \textsc{LCR} reduces P99 time-to-first-token (TTFT) by up to 28.3\% on LLM workloads and increases throughput by up to 24.2\% on deep learning recommendation (DLRM) workloads. Even with poor predictions, performance degrades gracefully and remains close to \textsc{LRU}, demonstrating robustness with practical value.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes learning-augmented LRU, a caching algorithm that integrates predictions to achieve 1-consistency and O(k)-robustness while incurring low time and space overhead and maintaining compatibility with existing systems. It then builds the LCR GPU cache on this policy and reports empirical gains of up to 28.3% reduction in P99 TTFT on LLM workloads and 24.2% throughput improvement on DLRM workloads, with graceful degradation under inaccurate predictions.

Significance. If the claimed consistency and robustness properties are preserved under the low-overhead implementation and the experiments adequately control for prediction quality and workload selection, the work would meaningfully advance the practical deployment of learning-augmented algorithms in GPU inference systems, closing a noted gap between theoretical guarantees and real hardware constraints.

major comments (2)
  1. [Experimental Evaluation] The central claim requires that the O(k)-robustness bound survives the practical implementation; however, the manuscript provides no micro-benchmark or per-request latency breakdown isolating the auxiliary predictor state and eviction-score recomputation costs relative to plain LRU (see the experimental setup and LCR implementation sections). Without this, it is impossible to confirm that the added overhead remains within the strict per-operation budget of modern GPU inference kernels.
  2. [Learning-Augmented LRU] The abstract asserts 1-consistency and O(k)-robustness for learning-augmented LRU, yet the manuscript contains no derivation steps, pseudocode, or reduction showing how the low-overhead modifications preserve these properties from the cited prior literature (pmlr-v80-lykouris18a, mitzenmacher2022algorithms). This derivation is load-bearing for the deployment-oriented positioning.
minor comments (2)
  1. [Theoretical Analysis] Notation for the robustness parameter k is introduced without an explicit definition or relation to cache size in the theoretical section.
  2. [Experiments] Figure captions for the TTFT and throughput plots should include error bars or the number of runs to clarify statistical significance of the reported 28.3% and 24.2% gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address the major comments point by point below, and we are prepared to incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experimental Evaluation] The central claim requires that the O(k)-robustness bound survives the practical implementation; however, the manuscript provides no micro-benchmark or per-request latency breakdown isolating the auxiliary predictor state and eviction-score recomputation costs relative to plain LRU (see the experimental setup and LCR implementation sections). Without this, it is impossible to confirm that the added overhead remains within the strict per-operation budget of modern GPU inference kernels.

    Authors: We agree that providing a micro-benchmark or per-request latency breakdown to isolate the costs of the auxiliary predictor state and eviction-score recomputation would help confirm that the overhead is low enough for modern GPU inference kernels. Our current experiments focus on end-to-end performance metrics, which show improvements without apparent overhead penalties. To address this, we will add a new subsection in the experimental evaluation with micro-benchmarks comparing the per-operation costs to plain LRU. revision: yes

  2. Referee: [Learning-Augmented LRU] The abstract asserts 1-consistency and O(k)-robustness for learning-augmented LRU, yet the manuscript contains no derivation steps, pseudocode, or reduction showing how the low-overhead modifications preserve these properties from the cited prior literature (pmlr-v80-lykouris18a, mitzenmacher2022algorithms). This derivation is load-bearing for the deployment-oriented positioning.

    Authors: We appreciate this observation. The low-overhead modifications to the learning-augmented LRU are intended to preserve the 1-consistency and O(k)-robustness guarantees from the referenced works by maintaining the core integration of predictions while optimizing for efficiency. However, we acknowledge that the manuscript would benefit from explicit derivation steps and pseudocode. We will add a new section or appendix that provides the derivation, including how the modifications reduce to the original bounds, along with pseudocode for the algorithm. revision: yes

Circularity Check

0 steps flagged

No significant circularity; guarantees rest on external literature

full rationale

The paper's core proposal of learning-augmented LRU with 1-consistency and O(k)-robustness is explicitly positioned as building on cited prior work (Lykouris et al. 2018 and Mitzenmacher 2022) rather than deriving those properties from quantities fitted or defined inside this manuscript. No equations, self-citations, or ansatzes are shown reducing the claimed guarantees or experimental gains (28.3% TTFT, 24.2% throughput) to tautological inputs. The contribution centers on practical GPU deployment and compatibility, which is independently assessed via experiments; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proposal rests on standard definitions of consistency and robustness from the cited learning-augmented algorithms literature and on the assumption that ML predictors can be queried with negligible overhead; no free parameters or new invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Standard definitions of 1-consistency and O(k)-robustness from learning-augmented caching literature
    The guarantees are defined with respect to these prior notions as referenced in the abstract.

pith-pipeline@v0.9.0 · 5874 in / 1276 out tokens · 49950 ms · 2026-05-18T14:24:41.442703+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving

    cs.DC 2026-05 unverdicted novelty 6.0

    HELM adaptively partitions HBM between EMB and KV caches via a three-layer PPO controller and EMB-KV-aware scheduling, reducing P99 latency by 24-38% while achieving 93.5-99.6% SLO satisfaction on production workloads.

  2. SCION: Size-aware Policy Orchestration for Nonstationary Object Caches (Long Paper Version)

    cs.DC 2026-03 conditional novelty 5.0

    SCION is a lightweight orchestration layer that picks among six deployable cache policies via an offline-trained linear selector on short-prefix size and reuse fingerprints, improving cacheable miss ratio over SIEVE o...

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 2 Pith papers · 2 internal anchors

  1. [1]

    URL https://huggingface.co/ datasets/philschmid/sharegpt-raw/tree/main/ sharegpt_90k_raw_dataset

    Sharegpt raw. URL https://huggingface.co/ datasets/philschmid/sharegpt-raw/tree/main/ sharegpt_90k_raw_dataset

  2. [2]

    URL https://crc2.ece.tamu.edu/

    The 2nd cache replacement championship, 2017. URL https://crc2.ece.tamu.edu/

  3. [3]

    A dynamic hash table for the gpu

    Saman Ashkiani, Martin Farach-Colton, and John D Owens. A dynamic hash table for the gpu. In2018 IEEE international parallel and distributed processing symposium (IPDPS), pages 419–429. IEEE, 2018

  4. [4]

    {LHD}: Improving cache hit rate by maximizing hit density

    Nathan Beckmann, Haoxian Chen, and Asaf Cidon. {LHD}: Improving cache hit rate by maximizing hit density. In15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), pages 389–403, 2018

  5. [5]

    Online computation and competitive analysis, 1998

    Allan Borodin and Ran El-Yaniv. Online computation and competitive analysis, 1998

  6. [6]

    Updlrm: Accelerating personalized recommendation using real-world pim architecture

    Sitian Chen, Haobin Tan, Amelie Chi Zhou, Yusen Li, and Pavan Balaji. Updlrm: Accelerating personalized recommendation using real-world pim architecture. In Proceedings of the 61st ACM/IEEE Design Automation Conference, pages 1–6, 2024

  7. [7]

    Ecco: Improving mem- ory bandwidth and capacity for llms via entropy-aware cache compression

    Feng Cheng, Cong Guo, Chiyue Wei, Junyao Zhang, Changchun Zhou, Edward Hanson, Jiaqi Zhang, Xiaox- iao Liu, Hai Li, and Yiran Chen. Ecco: Improving mem- ory bandwidth and capacity for llms via entropy-aware cache compression. InProceedings of the 52nd An- nual International Symposium on Computer Architec- ture, pages 793–807, 2025

  8. [8]

    Bandana: Using non- volatile memory for storing deep learning models.Pro- ceedings of machine learning and systems, 1:40–52, 2019

    Assaf Eisenman, Maxim Naumov, Darryl Gardner, Misha Smelyanskiy, Sergey Pupyrev, Kim Hazelwood, Asaf Cidon, and Sachin Katti. Bandana: Using non- volatile memory for storing deep learning models.Pro- ceedings of machine learning and systems, 1:40–52, 2019

  9. [9]

    Compet- itive paging algorithms.Journal of Algorithms, 12(4): 685–699, 1991

    Amos Fiat, Richard M Karp, Michael Luby, Lyle A Mc- Geoch, Daniel D Sleator, and Neal E Young. Compet- itive paging algorithms.Journal of Algorithms, 12(4): 685–699, 1991

  10. [10]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  11. [11]

    Raven: belady-guided, predictive (deep) learning for in-memory and content caching

    Xinyue Hu, Eman Ramadan, Wei Ye, Feng Tian, and Zhi-Li Zhang. Raven: belady-guided, predictive (deep) learning for in-memory and content caching. InProceed- ings of the 18th International Conference on emerging Networking EXperiments and Technologies, pages 72– 90, 2022

  12. [12]

    Back to the future: Lever- aging belady’s algorithm for improved cache replace- ment.ACM SIGARCH Computer Architecture News, 44 (3):78–89, 2016

    Akanksha Jain and Calvin Lin. Back to the future: Lever- aging belady’s algorithm for improved cache replace- ment.ACM SIGARCH Computer Architecture News, 44 (3):78–89, 2016

  13. [13]

    Microrec: Effi- cient recommendation inference by hardware and data structure solutions.Proceedings of Machine Learning and Systems, 3:845–859, 2021

    Wenqi Jiang, Zhenhao He, Shuai Zhang, Thomas B Preußer, Kai Zeng, Liang Feng, Jiansong Zhang, Tongx- uan Liu, Yong Li, Jingren Zhou, et al. Microrec: Effi- cient recommendation inference by hardware and data structure solutions.Proceedings of Machine Learning and Systems, 3:845–859, 2021

  14. [14]

    Fleetrec: Large-scale recommendation inference on hybrid gpu-fpga clusters

    Wenqi Jiang, Zhenhao He, Shuai Zhang, Kai Zeng, Liang Feng, Jiansong Zhang, Tongxuan Liu, Yong Li, Jingren Zhou, Ce Zhang, et al. Fleetrec: Large-scale recommendation inference on hybrid gpu-fpga clusters. InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 3097– 3105, 2021

  15. [15]

    Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems, 30, 2017

    Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems, 30, 2017

  16. [16]

    Rec- nmp: Accelerating personalized recommendation with near-memory processing

    Liu Ke, Udit Gupta, Benjamin Youngjae Cho, David Brooks, Vikas Chandra, Utku Diril, Amin Firoozshahian, Kim Hazelwood, Bill Jia, Hsien-Hsin S Lee, et al. Rec- nmp: Accelerating personalized recommendation with near-memory processing. In2020 ACM/IEEE 47th An- nual International Symposium on Computer Architec- ture (ISCA), pages 790–803. IEEE, 2020

  17. [17]

    Evstore: Storage and caching ca- pabilities for scaling embedding tables in deep recom- mendation systems

    Daniar H Kurniawan, Ruipu Wang, Kahfi S Zulkifli, Fandi A Wiranata, John Bent, Ymir Vigfusson, and Haryadi S Gunawi. Evstore: Storage and caching ca- pabilities for scaling embedding tables in deep recom- mendation systems. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vol-...

  18. [18]

    Efficient memory man- agement for large language model serving with page- dattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023. 13

  19. [19]

    Tensor- dimm: A practical near-memory processing architecture for embeddings and tensor operations in deep learning

    Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. Tensor- dimm: A practical near-memory processing architecture for embeddings and tensor operations in deep learning. InProceedings of the 52nd Annual IEEE/ACM Interna- tional Symposium on Microarchitecture, pages 740–753, 2019

  20. [20]

    {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache manage- ment

    Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache manage- ment. In18th USENIX Symposium on Operating Sys- tems Design and Implementation (OSDI 24), pages 155– 172, 2024

  21. [21]

    Merci: efficient embedding reduction on commodity hardware via sub-query memoization

    Yejin Lee, Seong Hoon Seo, Hyunji Choi, Hyoung Uk Sul, Soosung Kim, Jae W Lee, and Tae Jun Ham. Merci: efficient embedding reduction on commodity hardware via sub-query memoization. InProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 302–313, 2021

  22. [22]

    An imitation learn- ing approach for cache replacement

    Evan Liu, Milad Hashemi, Kevin Swersky, Parthasarathy Ranganathan, and Junwhan Ahn. An imitation learn- ing approach for cache replacement. InInternational Conference on Machine Learning, pages 6237–6247. PMLR, 2020

  23. [23]

    Competitive caching with machine learned advice

    Thodoris Lykouris and Sergei Vassilvtiskii. Competitive caching with machine learned advice. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th Inter- national Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 3296–3305. PMLR, 10–15 Jul 2018. URL https:// proceedings.mlr.press/v80/lykouris18a.html

  24. [24]

    Algo- rithms with predictions.Commun

    Michael Mitzenmacher and Sergei Vassilvitskii. Algo- rithms with predictions.Commun. ACM, 65(7):33–35, June 2022. ISSN 0001-0782. doi: 10.1145/3528087. URLhttps://doi.org/10.1145/3528087

  25. [25]

    Algo- rithms with predictions.Communications of the ACM, 65(7):33–35, 2022

    Michael Mitzenmacher and Sergei Vassilvitskii. Algo- rithms with predictions.Communications of the ACM, 65(7):33–35, 2022

  26. [26]

    Flips: A flexible par- titioning strategy near memory processing architecture for recommendation system.IEEE Transactions on Parallel and Distributed Systems, 2025

    Yudi Qiu, Lingfei Lu, Shiyan Yi, Minge Jing, Xiaoyang Zeng, Yang Kong, and Yibo Fan. Flips: A flexible par- titioning strategy near memory processing architecture for recommendation system.IEEE Transactions on Parallel and Distributed Systems, 2025

  27. [27]

    Recasens, Ferran Agullo, Yue Zhu, Chen Wang, Eun Kyung Lee, Olivier Tardieu, Jordi Torres, and Josep Ll

    Pol G. Recasens, Ferran Agullo, Yue Zhu, Chen Wang, Eun Kyung Lee, Olivier Tardieu, Jordi Torres, and Josep Ll. Berral. Mind the memory gap: Unveiling gpu bottlenecks in large-batch llm inference, 2025. URL https://arxiv.org/abs/2503.08311

  28. [28]

    Machine learning-guided memory optimization for dlrm infer- ence on tiered memory

    Jie Ren, Bin Ma, Shuangyan Yang, Benjamin Francis, Ehsan K Ardestani, Min Si, and Dong Li. Machine learning-guided memory optimization for dlrm infer- ence on tiered memory. In2025 IEEE International Symposium on High Performance Computer Architec- ture (HPCA), pages 1631–1647. IEEE, 2025

  29. [29]

    Learning cache replacement with {CACHEUS}

    Liana V Rodriguez, Farzana Yusuf, Steven Lyons, Eysler Paz, Raju Rangaswami, Jason Liu, Ming Zhao, and Giri Narasimhan. Learning cache replacement with {CACHEUS}. In19th USENIX Conference on File and Storage Technologies (FAST 21), pages 341–354, 2021

  30. [30]

    Near-optimal bounds for online caching with machine learned advice

    Dhruv Rohatgi. Near-optimal bounds for online caching with machine learned advice. InProceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1834–1845. SIAM, 2020

  31. [31]

    Al- gorithms for caching and MTS with reduced number of predictions

    Karim Ahmed Abdel Sadek and Marek Elias. Al- gorithms for caching and MTS with reduced number of predictions. InThe Twelfth International Confer- ence on Learning Representations, 2024. URL https: //openreview.net/forum?id=QuIiLSktO4

  32. [32]

    Applying deep learning to the cache replacement problem

    Zhan Shi, Xiangru Huang, Akanksha Jain, and Calvin Lin. Applying deep learning to the cache replacement problem. InProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 413–425, 2019

  33. [33]

    Ugache: A unified gpu cache for embedding- based deep learning

    Xiaoniu Song, Yiwen Zhang, Rong Chen, and Haibo Chen. Ugache: A unified gpu cache for embedding- based deep learning. InProceedings of the 29th Sympo- sium on Operating Systems Principles, pages 627–641, 2023

  34. [34]

    Learning relaxed belady for content distribution network caching

    Zhenyu Song, Daniel S Berger, Kai Li, Anees Shaikh, Wyatt Lloyd, Soudeh Ghorbani, Changhoon Kim, Aditya Akella, Arvind Krishnamurthy, Emmett Witchel, et al. Learning relaxed belady for content distribution network caching. In17th USENIX Symposium on Net- worked Systems Design and Implementation (NSDI 20), pages 529–544, 2020

  35. [35]

    {HALP}: Heuristic aided learned preference eviction policy for {YouTube} content delivery network

    Zhenyu Song, Kevin Chen, Nikhil Sarda, Deniz Al- tınbüken, Eugene Brevdo, Jimmy Coleman, Xiao Ju, Pawel Jurczyk, Richard Schooler, and Ramki Gummadi. {HALP}: Heuristic aided learned preference eviction policy for {YouTube} content delivery network. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 1149–1163, 2023

  36. [36]

    Qwen2 Technical Report

    Qwen Team. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

  37. [37]

    Aibrix: Towards scalable, cost-effective large language model inference infrastructure.arXiv preprint arXiv:2504.03648, 2025

    The AIBrix Team, Jiaxin Shan, Varun Gupta, Le Xu, Haiyang Shi, Jingyuan Zhang, Ning Wang, Linhui Xu, 14 Rong Kang, Tongping Liu, et al. Aibrix: Towards scal- able, cost-effective large language model inference in- frastructure.arXiv preprint arXiv:2504.03648, 2025

  38. [38]

    Driving cache replacement with {ML-based}{LeCaR}

    Giuseppe Vietri, Liana V Rodriguez, Wendy A Martinez, Steven Lyons, Jason Liu, Raju Rangaswami, Ming Zhao, and Giri Narasimhan. Driving cache replacement with {ML-based}{LeCaR}. In10th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 18), 2018

  39. [39]

    Kvcache cache in the wild: Char- acterizing and optimizing kvcache cache at a large cloud provider

    Jiahao Wang, Jinbo Han, Xingda Wei, Sijie Shen, Dingyan Zhang, Chenguang Fang, Rong Chen, Wenyuan Yu, and Haibo Chen. Kvcache cache in the wild: Char- acterizing and optimizing kvcache cache at a large cloud provider. In2025 USENIX Annual Technical Confer- ence (USENIX ATC 25). USENIX Association, July

  40. [40]

    URL https://www.usenix.org/conference/ atc25/presentation/wang-jiahao

  41. [41]

    Abel, Xu Guo, Jianbing Dong, Ji Shi, and Kunlun Li

    Zehuan Wang, Yingcan Wei, Minseok Lee, Matthias Langer, Fan Yu, Jie Liu, Shijie Liu, Daniel G. Abel, Xu Guo, Jianbing Dong, Ji Shi, and Kunlun Li. Merlin hugectr: Gpu-accelerated recommender system training and inference. InProceedings of the 16th ACM Con- ference on Recommender Systems, RecSys ’22, page 534–537, New York, NY , USA, 2022. Association for ...

  42. [42]

    Better and simpler learning-augmented online caching.arXiv preprint arXiv:2005.13716, 2020

    Alexander Wei. Better and simpler learning-augmented online caching.arXiv preprint arXiv:2005.13716, 2020

  43. [43]

    A gpu-specialized inference parameter server for large-scale deep recom- mendation models

    Yingcan Wei, Matthias Langer, Fan Yu, Minseok Lee, Jie Liu, Ji Shi, and Zehuan Wang. A gpu-specialized inference parameter server for large-scale deep recom- mendation models. InProceedings of the 16th ACM Conference on Recommender Systems, pages 408–419, 2022

  44. [44]

    Fleche: an efficient gpu embedding cache for personalized recommendations

    Minhui Xie, Youyou Lu, Jiazhen Lin, Qing Wang, Jian Gao, Kai Ren, and Jiwu Shu. Fleche: an efficient gpu embedding cache for personalized recommendations. In Proceedings of the Seventeenth European Conference on Computer Systems, pages 402–416, 2022

  45. [45]

    Learning from op- timal caching for content delivery

    Gang Yan, Jian Li, and Don Towsley. Learning from op- timal caching for content delivery. InProceedings of the 17th International Conference on emerging Networking EXperiments and Technologies, pages 344–358, 2021

  46. [46]

    A learned cache eviction framework with mini- mal overhead.arXiv preprint arXiv:2301.11886, 2023

    Dongsheng Yang, Daniel S Berger, Kai Li, and Wyatt Lloyd. A learned cache eviction framework with mini- mal overhead.arXiv preprint arXiv:2301.11886, 2023

  47. [47]

    {GL-Cache}: Group-level learning for efficient and high-performance caching

    Juncheng Yang, Ziming Mao, Yao Yue, and KV Rashmi. {GL-Cache}: Group-level learning for efficient and high-performance caching. In21st USENIX Confer- ence on File and Storage Technologies (FAST 23), pages 115–134, 2023

  48. [48]

    {GPU-Disaggregated} serving for deep learning recommendation models at scale

    Lingyun Yang, Yongchen Wang, Yinghao Yu, Qizhen Weng, Jianbo Dong, Kan Liu, Chi Zhang, Yanyi Zi, Hao Li, Zechao Zhang, et al. {GPU-Disaggregated} serving for deep learning recommendation models at scale. In 22nd USENIX Symposium on Networked Systems De- sign and Implementation (NSDI 25), pages 847–863, 2025

  49. [49]

    Tenrec: A large-scale multipurpose benchmark dataset for recommender systems.Advances in Neural Information Processing Systems, 35:11480– 11493, 2022

    Guanghu Yuan, Fajie Yuan, Yudong Li, Beibei Kong, Shujie Li, Lei Chen, Min Yang, Chenyun Yu, Bo Hu, Zang Li, et al. Tenrec: A large-scale multipurpose benchmark dataset for recommender systems.Advances in Neural Information Processing Systems, 35:11480– 11493, 2022

  50. [50]

    {SIEVE} is simpler than {LRU}: an efficient {Turn-Key} eviction algorithm for web caches

    Yazhuo Zhang, Juncheng Yang, Yao Yue, Ymir Vig- fusson, and KV Rashmi. {SIEVE} is simpler than {LRU}: an efficient {Turn-Key} eviction algorithm for web caches. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 1229–1246, 2024

  51. [51]

    Two-level graph caching for expediting distributed gnn training

    Zhe Zhang, Ziyue Luo, and Chuan Wu. Two-level graph caching for expediting distributed gnn training. InIEEE INFOCOM 2023-IEEE Conference on Computer Com- munications, pages 1–10. IEEE, 2023

  52. [52]

    Song: Approxi- mate nearest neighbor search on gpu

    Weijie Zhao, Shulong Tan, and Ping Li. Song: Approxi- mate nearest neighbor search on gpu. In2020 IEEE 36th International Conference on Data Engineering (ICDE), pages 1033–1044. IEEE, 2020

  53. [53]

    Sglang: Efficient execution of structured language model programs.Advances in Neural Information Pro- cessing Systems, 37:62557–62583, 2024

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in Neural Information Pro- cessing Systems, 37:62557–62583, 2024

  54. [54]

    Deep inter- est evolution network for click-through rate prediction

    Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. Deep inter- est evolution network for click-through rate prediction. InProceedings of the AAAI conference on artificial in- telligence, volume 33, pages 5941–5948, 2019

  55. [55]

    {3L-Cache}: Low overhead and precise learning-based eviction policy for caches

    Wenbin Zhou, Zhixiong Niu, Yongqiang Xiong, Juan Fang, and Qian Wang. {3L-Cache}: Low overhead and precise learning-based eviction policy for caches. In 23rd USENIX Conference on File and Storage Technolo- gies (FAST 25), pages 237–254, 2025. 15 Artifact Appendix Abstract The provided artifacts include all code and non-confidential traces to reproduce the...

  56. [56]

    The benchmark supports direct com- parison of GPU embedding cache policies, including FPB (following predictions blindly), LARU, and clas- sical LRU

    SLS-Cache-Bench.For DLRMs, we develop a dedi- cated benchmark that isolates the SparseLengthsSum (SLS) operation from the HugeCTR framework and integrates LCR. The benchmark supports direct com- parison of GPU embedding cache policies, including FPB (following predictions blindly), LARU, and clas- sical LRU. We implement an alternative GPU cache in the fl...

  57. [57]

    This integration en- ables systematic comparison of GPU KV cache policies, including FPB, LARU, and classical LRU

    LCR-on-SGLang.For LLMs, we integrate LCR into SGLang (version 0.4.9.post2) by adding Light- GBM support, implementing an online training frame- work, and replacing the default RadixTree cache (radix_cache.py) with LARU. This integration en- ables systematic comparison of GPU KV cache policies, including FPB, LARU, and classical LRU. It comprises over 5K l...

  58. [58]

    DLRMs: CUDA 12.4, PyTorch 2.6.0, and either devtoolset-9orgcc-9

  59. [59]

    The hardware used to generate the results in this paper is described in Section 4.1 for DLRMs and Section 5.1 for LLMs

    LLMs: CUDA 12.6, PyTorch 2.6.0, and the required third-party Python packages, lightgbm. The hardware used to generate the results in this paper is described in Section 4.1 for DLRMs and Section 5.1 for LLMs. Nevertheless, the experiments can also be conducted on other GPUs, provided that the available HBM capacity is sufficient to accommodate both the ML ...

  60. [60]

    Bounded robustness.The robustness of LARU is O(k), as proved in Technical Appendix D. In other words, re- gardless of predictor accuracy, its competitive ratio is bounded by O(k), matching LRU’s k-competitive ra- tio asymptotically and safeguarding against catastrophic performance degradation or systemic risk

  61. [61]

    Under perfect predic- tions, LARU matches the offline optimal algorithm, fully leveraging accurate machine-learned predictions

    Ideal consistency.The consistency of LARU is 1, as proved in Technical Appendix E. Under perfect predic- tions, LARU matches the offline optimal algorithm, fully leveraging accurate machine-learned predictions

  62. [62]

    In DLRMs, embedding caches are partitioned into buckets (e.g., SlabHash in HugeCTR), each holding only tens of items (e.g., 64)

    Low time complexity.The algorithm runs in O(logk) amortized time per request, which is practical since k is small in practice. In DLRMs, embedding caches are partitioned into buckets (e.g., SlabHash in HugeCTR), each holding only tens of items (e.g., 64). In LLMs, KV cache vectors are typically organized viaRadixTrees in SGLang or paged-attention in vLLM,...

  63. [63]

    Reduced predictor usage.In asynchronous mode, pre- diction tasks are triggered only under specific conditions, such as at most once per fixed interval or after the pre- vious task has completed, preventing excessive usage. In synchronous mode, once a prediction-induced cache miss is detected (Line 14 in Algorithm 1), LARU falls back to LRU once, limiting ...

  64. [64]

    the requested item is a new item that has not yet been evicted in the current phase

  65. [65]

    The number of occurrences of case (1) equals the num- ber of distinct new items, ci

    the requested item was previously evicted by LRU’s policy within the same phase. The number of occurrences of case (1) equals the num- ber of distinct new items, ci. Note that after logb(k) occur- rences of case (2), λ has become 1/blogb(k), the candidate set L shrinks to size 1, and the eviction step at Line 25 degener- ates to LRU’s policy. Hence, the n...