Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

arxiv: 2509.20979 · v2 · submitted 2025-09-25 · 💻 cs.LG

Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

Peng Chen , Jiaji Zhang , Hailiang Zhao , Yirong Zhang , Shenyao Chen , Jiahong Yu , Xueyan Tang , Yixuan Wang

show 6 more authors

Hao Li Jianping Zou Gang Xiong Kingsum Chow Shuibing He Shuiguang Deng

This is my paper

Pith reviewed 2026-05-18 14:24 UTC · model grok-4.3

classification 💻 cs.LG

keywords learning-augmented cachingGPU cacheLRULLM inferenceDLRMconsistencyrobustnessML-based caching

0 comments p. Extension

The pith

Learning-augmented LRU guarantees 1-consistency and O(k)-robustness for GPU inference caches with low overhead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a caching algorithm that merges machine learning predictions with the classic LRU policy for use in GPU-based inference systems. The goal is to get the benefits of learned predictions for higher hit rates while ensuring the system does not fall too far behind standard LRU if the predictions turn out wrong. By proving 1-consistency and O(k)-robustness and keeping computational costs low, the method aims to make learning-based caching reliable enough for real deployment in large language model and recommendation model serving.

Core claim

We propose learning-augmented LRU, a deployment-oriented learning-augmented caching algorithm that guarantees 1-consistency and O(k)-robustness, incurs low time and space overhead, and maintains strong compatibility. We further build a GPU cache, called LCR, on top of learning-augmented LRU to benefit from its theoretical guarantees and translate them into practical performance. In experiments, LCR reduces P99 time-to-first-token by up to 28.3% on LLM workloads and increases throughput by up to 24.2% on DLRM workloads, with graceful degradation under poor predictions.

What carries the argument

Learning-augmented LRU, which integrates predictions into LRU eviction to achieve consistency and robustness guarantees while remaining efficient.

If this is right

Cache performance approaches the offline optimum when predictions are accurate.
Performance remains bounded relative to standard LRU under arbitrary prediction errors.
Low time and space overhead allows deployment in existing GPU inference pipelines.
The same guarantees apply across both LLM and DLRM inference workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Production inference systems could adopt this as a safer way to incorporate ML predictors into caching.
The robustness bound may allow the policy to replace LRU without additional monitoring of predictor quality.
The integration technique could be tested on other base policies beyond LRU.

Load-bearing premise

The theoretical consistency and robustness properties can be realized in a practical implementation that meets strict time and space efficiency constraints on real GPU hardware.

What would settle it

Measure LCR performance under deliberately inaccurate predictions on GPU hardware and check whether it stays within the stated O(k) factor of plain LRU or violates the bounds.

Figures

Figures reproduced from arXiv: 2509.20979 by Gang Xiong, Hailiang Zhao, Hao Li, Jiahong Yu, Jiaji Zhang, Jianping Zou, Kingsum Chow, Peng Chen, Shenyao Chen, Shuibing He, Shuiguang Deng, Xueyan Tang, Yirong Zhang, Yixuan Wang.

**Figure 2.** Figure 2: Key performance metrics differ between DLRMs and LLMs. (1) DLRMs process requests in tens of milliseconds, enabling high concurrency. System cost is dominated by throughput, making average latency the key optimization target. (2) LLMs involve longer sequences and heavier computation, resulting in lower concurrency. User experience depends heavily on tail latency, particularly P95/P99 TTFT, which is stron… view at source ↗

**Figure 3.** Figure 3: Illustration of two GPU caching use cases. (1) DLRMs: Inputs contain dense features and sparse features (e.g., user ID, product category). Dense features are processed by a bottom MLP, while sparse features are mapped into embedding vectors via large pre-trained tables, often stored hierarchically in GPU HBM and DRAM. (2) LLMs: Each request prompt is a token sequence whose KV vectors are computed during th… view at source ↗

**Figure 4.** Figure 4: Cache hit rate gap between LRU and the offline optimum (OPT) under varying cache ratios. Conventional cache replacement policies struggle under dynamic inference workloads, motivating the use of ML to adapt to evolving access patterns. 2.2 Existing learning-based caching systems To address the limitations of heuristic policies, recent work has explored learning-based caching [12, 22, 29, 32, 34, 35, 38, 4… view at source ↗

**Figure 6.** Figure 6: Comparison of performance guarantees among rep [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 5.** Figure 5: Cost of caching algorithms with different types of [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 7.** Figure 7: LCR’s architecture and workflows. in [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: LARU’s algorithmic process on a cache miss. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Average SLS latency improvement across different cache ratios and batch sizes. FPB and LARU employ the same LightGBM predictor. Numbers in boxes denote LARU’s improvement over LRU. Results are reported for (a) AD-CTR-User and (b) QB-video datasets, evaluated at cache ratios of 1%, 2%, and 3%. The cache ratio is cache size over total distinct items. Predictor settings. Due to the high concurrency of DLRM wo… view at source ↗

**Figure 10.** Figure 10: Robustness Verification. Average SLS latency under different batch sizes and cache ratios (1%-3%) on (a) AD-CTRUser and (b) QB-video datasets. Both FPB and LARU use the same low-accuracy LightGBM predictor to assess robustness. improves SLS throughput, particularly at the large batch sizes common in production environments. Robustness Verification. We empirically verify the robustness of LARU using an i… view at source ↗

**Figure 11.** Figure 11: P99 TTFT improvement of Qwen2.5-32B on the [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: P99 TTFT improvement of DeepSeek-R1-671B [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

**Figure 13.** Figure 13: Robustness Verification. KV cache hit rate under injected noise on Aibrix-Synthetic (top) and Online-QA (bottom). The probability of replacing predictions with worst-case noise is varied from 0.1 to 1.0. overly conservatively, without fully benefiting from accurate predictions. In contrast, LARU degrades gracefully, consistently maintaining higher hit rates than both FPB and LRU across all error levels. … view at source ↗

read the original abstract

In modern GPU inference, cache efficiency remains a major bottleneck, and heuristic policies such as \textsc{LRU} can perform far worse than the offline optimum. Existing learning-based caching systems improve hit rates mainly through predictor design, but often follow learned predictions blindly, making performance unreliable when predictions are inaccurate. In contrast, emerging learning-augmented caching algorithms~\cite{pmlr-v80-lykouris18a,mitzenmacher2022algorithms} provide performance guarantees by carefully integrating predictions into caching policies, achieving both \emph{consistency} (near-optimality under perfect predictions) and \emph{robustness} (bounded worst-case performance under prediction errors). However, deployment remains challenging. A practical algorithm should satisfy strict time and space efficiency constraints, which some theoretical work overlooks, while also incurring low deployment overhead. We propose learning-augmented LRU, a deployment-oriented learning-augmented caching algorithm that guarantees \emph{1-consistency} and \emph{$O(k)$-robustness}, incurs low time and space overhead, and maintains strong compatibility. We further build a GPU cache, called \textsc{LCR}, on top of learning-augmented LRU to benefit from its theoretical guarantees and translate them into practical performance. In experiments, \textsc{LCR} reduces P99 time-to-first-token (TTFT) by up to 28.3\% on LLM workloads and increases throughput by up to 24.2\% on deep learning recommendation (DLRM) workloads. Even with poor predictions, performance degrades gracefully and remains close to \textsc{LRU}, demonstrating robustness with practical value.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adapts learning-augmented LRU to GPU inference with stated 1-consistency and O(k)-robustness plus measured speedups, but the low-overhead claim that makes the guarantees usable in practice is not strongly anchored.

read the letter

The punchline on this paper is that it adapts learning-augmented caching algorithms to GPU inference by proposing a deployment-oriented learning-augmented LRU with 1-consistency and O(k)-robustness guarantees, then demonstrates practical gains through the LCR cache implementation. It does a good job highlighting the gap in prior work where theoretical algorithms overlook efficiency constraints for real hardware. The experiments on LLM workloads showing up to 28.3% reduction in P99 TTFT and on DLRM with 24.2% throughput improvement, along with graceful degradation under poor predictions, provide concrete evidence that the approach can work in practice. This makes the robustness property more than just theoretical. The main soft spot is whether the low time and space overhead is actually achieved in the implementation. The theoretical properties only hold if the added machinery for predictions and scoring doesn't exceed the tight latency budgets of GPU inference. Without detailed micro-benchmarks comparing per-request costs to standard LRU, it's hard to be sure the robustness carries over to the measured results. The stress-test concern about auxiliary state potentially breaking the bounds seems relevant here unless the paper addresses it directly with numbers. This paper is for systems researchers focused on ML inference serving and caching optimizations. Readers who want to see learning-augmented methods applied to constrained environments with both guarantees and experiments will find it worthwhile. It has enough substance to deserve a serious referee, as the application is timely and the claims can be evaluated. I recommend sending it for peer review rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes learning-augmented LRU, a caching algorithm that integrates predictions to achieve 1-consistency and O(k)-robustness while incurring low time and space overhead and maintaining compatibility with existing systems. It then builds the LCR GPU cache on this policy and reports empirical gains of up to 28.3% reduction in P99 TTFT on LLM workloads and 24.2% throughput improvement on DLRM workloads, with graceful degradation under inaccurate predictions.

Significance. If the claimed consistency and robustness properties are preserved under the low-overhead implementation and the experiments adequately control for prediction quality and workload selection, the work would meaningfully advance the practical deployment of learning-augmented algorithms in GPU inference systems, closing a noted gap between theoretical guarantees and real hardware constraints.

major comments (2)

[Experimental Evaluation] The central claim requires that the O(k)-robustness bound survives the practical implementation; however, the manuscript provides no micro-benchmark or per-request latency breakdown isolating the auxiliary predictor state and eviction-score recomputation costs relative to plain LRU (see the experimental setup and LCR implementation sections). Without this, it is impossible to confirm that the added overhead remains within the strict per-operation budget of modern GPU inference kernels.
[Learning-Augmented LRU] The abstract asserts 1-consistency and O(k)-robustness for learning-augmented LRU, yet the manuscript contains no derivation steps, pseudocode, or reduction showing how the low-overhead modifications preserve these properties from the cited prior literature (pmlr-v80-lykouris18a, mitzenmacher2022algorithms). This derivation is load-bearing for the deployment-oriented positioning.

minor comments (2)

[Theoretical Analysis] Notation for the robustness parameter k is introduced without an explicit definition or relation to cache size in the theoretical section.
[Experiments] Figure captions for the TTFT and throughput plots should include error bars or the number of runs to clarify statistical significance of the reported 28.3% and 24.2% gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address the major comments point by point below, and we are prepared to incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Experimental Evaluation] The central claim requires that the O(k)-robustness bound survives the practical implementation; however, the manuscript provides no micro-benchmark or per-request latency breakdown isolating the auxiliary predictor state and eviction-score recomputation costs relative to plain LRU (see the experimental setup and LCR implementation sections). Without this, it is impossible to confirm that the added overhead remains within the strict per-operation budget of modern GPU inference kernels.

Authors: We agree that providing a micro-benchmark or per-request latency breakdown to isolate the costs of the auxiliary predictor state and eviction-score recomputation would help confirm that the overhead is low enough for modern GPU inference kernels. Our current experiments focus on end-to-end performance metrics, which show improvements without apparent overhead penalties. To address this, we will add a new subsection in the experimental evaluation with micro-benchmarks comparing the per-operation costs to plain LRU. revision: yes
Referee: [Learning-Augmented LRU] The abstract asserts 1-consistency and O(k)-robustness for learning-augmented LRU, yet the manuscript contains no derivation steps, pseudocode, or reduction showing how the low-overhead modifications preserve these properties from the cited prior literature (pmlr-v80-lykouris18a, mitzenmacher2022algorithms). This derivation is load-bearing for the deployment-oriented positioning.

Authors: We appreciate this observation. The low-overhead modifications to the learning-augmented LRU are intended to preserve the 1-consistency and O(k)-robustness guarantees from the referenced works by maintaining the core integration of predictions while optimizing for efficiency. However, we acknowledge that the manuscript would benefit from explicit derivation steps and pseudocode. We will add a new section or appendix that provides the derivation, including how the modifications reduce to the original bounds, along with pseudocode for the algorithm. revision: yes

Circularity Check

0 steps flagged

No significant circularity; guarantees rest on external literature

full rationale

The paper's core proposal of learning-augmented LRU with 1-consistency and O(k)-robustness is explicitly positioned as building on cited prior work (Lykouris et al. 2018 and Mitzenmacher 2022) rather than deriving those properties from quantities fitted or defined inside this manuscript. No equations, self-citations, or ansatzes are shown reducing the claimed guarantees or experimental gains (28.3% TTFT, 24.2% throughput) to tautological inputs. The contribution centers on practical GPU deployment and compatibility, which is independently assessed via experiments; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proposal rests on standard definitions of consistency and robustness from the cited learning-augmented algorithms literature and on the assumption that ML predictors can be queried with negligible overhead; no free parameters or new invented entities are introduced in the abstract.

axioms (1)

domain assumption Standard definitions of 1-consistency and O(k)-robustness from learning-augmented caching literature
The guarantees are defined with respect to these prior notions as referenced in the abstract.

pith-pipeline@v0.9.0 · 5874 in / 1276 out tokens · 49950 ms · 2026-05-18T14:24:41.442703+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving
cs.DC 2026-05 unverdicted novelty 6.0

HELM adaptively partitions HBM between EMB and KV caches via a three-layer PPO controller and EMB-KV-aware scheduling, reducing P99 latency by 24-38% while achieving 93.5-99.6% SLO satisfaction on production workloads.
SCION: Size-aware Policy Orchestration for Nonstationary Object Caches (Long Paper Version)
cs.DC 2026-03 conditional novelty 5.0

SCION is a lightweight orchestration layer that picks among six deployable cache policies via an offline-trained linear selector on short-prefix size and reuse fingerprints, improving cacheable miss ratio over SIEVE o...

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 2 Pith papers · 2 internal anchors

[1]

URL https://huggingface.co/ datasets/philschmid/sharegpt-raw/tree/main/ sharegpt_90k_raw_dataset

Sharegpt raw. URL https://huggingface.co/ datasets/philschmid/sharegpt-raw/tree/main/ sharegpt_90k_raw_dataset

work page
[2]

URL https://crc2.ece.tamu.edu/

The 2nd cache replacement championship, 2017. URL https://crc2.ece.tamu.edu/

work page 2017
[3]

A dynamic hash table for the gpu

Saman Ashkiani, Martin Farach-Colton, and John D Owens. A dynamic hash table for the gpu. In2018 IEEE international parallel and distributed processing symposium (IPDPS), pages 419–429. IEEE, 2018

work page 2018
[4]

{LHD}: Improving cache hit rate by maximizing hit density

Nathan Beckmann, Haoxian Chen, and Asaf Cidon. {LHD}: Improving cache hit rate by maximizing hit density. In15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), pages 389–403, 2018

work page 2018
[5]

Online computation and competitive analysis, 1998

Allan Borodin and Ran El-Yaniv. Online computation and competitive analysis, 1998

work page 1998
[6]

Updlrm: Accelerating personalized recommendation using real-world pim architecture

Sitian Chen, Haobin Tan, Amelie Chi Zhou, Yusen Li, and Pavan Balaji. Updlrm: Accelerating personalized recommendation using real-world pim architecture. In Proceedings of the 61st ACM/IEEE Design Automation Conference, pages 1–6, 2024

work page 2024
[7]

Ecco: Improving mem- ory bandwidth and capacity for llms via entropy-aware cache compression

Feng Cheng, Cong Guo, Chiyue Wei, Junyao Zhang, Changchun Zhou, Edward Hanson, Jiaqi Zhang, Xiaox- iao Liu, Hai Li, and Yiran Chen. Ecco: Improving mem- ory bandwidth and capacity for llms via entropy-aware cache compression. InProceedings of the 52nd An- nual International Symposium on Computer Architec- ture, pages 793–807, 2025

work page 2025
[8]

Bandana: Using non- volatile memory for storing deep learning models.Pro- ceedings of machine learning and systems, 1:40–52, 2019

Assaf Eisenman, Maxim Naumov, Darryl Gardner, Misha Smelyanskiy, Sergey Pupyrev, Kim Hazelwood, Asaf Cidon, and Sachin Katti. Bandana: Using non- volatile memory for storing deep learning models.Pro- ceedings of machine learning and systems, 1:40–52, 2019

work page 2019
[9]

Compet- itive paging algorithms.Journal of Algorithms, 12(4): 685–699, 1991

Amos Fiat, Richard M Karp, Michael Luby, Lyle A Mc- Geoch, Daniel D Sleator, and Neal E Young. Compet- itive paging algorithms.Journal of Algorithms, 12(4): 685–699, 1991

work page 1991
[10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Raven: belady-guided, predictive (deep) learning for in-memory and content caching

Xinyue Hu, Eman Ramadan, Wei Ye, Feng Tian, and Zhi-Li Zhang. Raven: belady-guided, predictive (deep) learning for in-memory and content caching. InProceed- ings of the 18th International Conference on emerging Networking EXperiments and Technologies, pages 72– 90, 2022

work page 2022
[12]

Back to the future: Lever- aging belady’s algorithm for improved cache replace- ment.ACM SIGARCH Computer Architecture News, 44 (3):78–89, 2016

Akanksha Jain and Calvin Lin. Back to the future: Lever- aging belady’s algorithm for improved cache replace- ment.ACM SIGARCH Computer Architecture News, 44 (3):78–89, 2016

work page 2016
[13]

Microrec: Effi- cient recommendation inference by hardware and data structure solutions.Proceedings of Machine Learning and Systems, 3:845–859, 2021

Wenqi Jiang, Zhenhao He, Shuai Zhang, Thomas B Preußer, Kai Zeng, Liang Feng, Jiansong Zhang, Tongx- uan Liu, Yong Li, Jingren Zhou, et al. Microrec: Effi- cient recommendation inference by hardware and data structure solutions.Proceedings of Machine Learning and Systems, 3:845–859, 2021

work page 2021
[14]

Fleetrec: Large-scale recommendation inference on hybrid gpu-fpga clusters

Wenqi Jiang, Zhenhao He, Shuai Zhang, Kai Zeng, Liang Feng, Jiansong Zhang, Tongxuan Liu, Yong Li, Jingren Zhou, Ce Zhang, et al. Fleetrec: Large-scale recommendation inference on hybrid gpu-fpga clusters. InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 3097– 3105, 2021

work page 2021
[15]

Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems, 30, 2017

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems, 30, 2017

work page 2017
[16]

Rec- nmp: Accelerating personalized recommendation with near-memory processing

Liu Ke, Udit Gupta, Benjamin Youngjae Cho, David Brooks, Vikas Chandra, Utku Diril, Amin Firoozshahian, Kim Hazelwood, Bill Jia, Hsien-Hsin S Lee, et al. Rec- nmp: Accelerating personalized recommendation with near-memory processing. In2020 ACM/IEEE 47th An- nual International Symposium on Computer Architec- ture (ISCA), pages 790–803. IEEE, 2020

work page 2020
[17]

Evstore: Storage and caching ca- pabilities for scaling embedding tables in deep recom- mendation systems

Daniar H Kurniawan, Ruipu Wang, Kahfi S Zulkifli, Fandi A Wiranata, John Bent, Ymir Vigfusson, and Haryadi S Gunawi. Evstore: Storage and caching ca- pabilities for scaling embedding tables in deep recom- mendation systems. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vol-...

work page 2023
[18]

Efficient memory man- agement for large language model serving with page- dattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023. 13

work page 2023
[19]

Tensor- dimm: A practical near-memory processing architecture for embeddings and tensor operations in deep learning

Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. Tensor- dimm: A practical near-memory processing architecture for embeddings and tensor operations in deep learning. InProceedings of the 52nd Annual IEEE/ACM Interna- tional Symposium on Microarchitecture, pages 740–753, 2019

work page 2019
[20]

{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache manage- ment

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache manage- ment. In18th USENIX Symposium on Operating Sys- tems Design and Implementation (OSDI 24), pages 155– 172, 2024

work page 2024
[21]

Merci: efficient embedding reduction on commodity hardware via sub-query memoization

Yejin Lee, Seong Hoon Seo, Hyunji Choi, Hyoung Uk Sul, Soosung Kim, Jae W Lee, and Tae Jun Ham. Merci: efficient embedding reduction on commodity hardware via sub-query memoization. InProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 302–313, 2021

work page 2021
[22]

An imitation learn- ing approach for cache replacement

Evan Liu, Milad Hashemi, Kevin Swersky, Parthasarathy Ranganathan, and Junwhan Ahn. An imitation learn- ing approach for cache replacement. InInternational Conference on Machine Learning, pages 6237–6247. PMLR, 2020

work page 2020
[23]

Competitive caching with machine learned advice

Thodoris Lykouris and Sergei Vassilvtiskii. Competitive caching with machine learned advice. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th Inter- national Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 3296–3305. PMLR, 10–15 Jul 2018. URL https:// proceedings.mlr.press/v80/lykouris18a.html

work page 2018
[24]

Algo- rithms with predictions.Commun

Michael Mitzenmacher and Sergei Vassilvitskii. Algo- rithms with predictions.Commun. ACM, 65(7):33–35, June 2022. ISSN 0001-0782. doi: 10.1145/3528087. URLhttps://doi.org/10.1145/3528087

work page doi:10.1145/3528087 2022
[25]

Algo- rithms with predictions.Communications of the ACM, 65(7):33–35, 2022

Michael Mitzenmacher and Sergei Vassilvitskii. Algo- rithms with predictions.Communications of the ACM, 65(7):33–35, 2022

work page 2022
[26]

Flips: A flexible par- titioning strategy near memory processing architecture for recommendation system.IEEE Transactions on Parallel and Distributed Systems, 2025

Yudi Qiu, Lingfei Lu, Shiyan Yi, Minge Jing, Xiaoyang Zeng, Yang Kong, and Yibo Fan. Flips: A flexible par- titioning strategy near memory processing architecture for recommendation system.IEEE Transactions on Parallel and Distributed Systems, 2025

work page 2025
[27]

Recasens, Ferran Agullo, Yue Zhu, Chen Wang, Eun Kyung Lee, Olivier Tardieu, Jordi Torres, and Josep Ll

Pol G. Recasens, Ferran Agullo, Yue Zhu, Chen Wang, Eun Kyung Lee, Olivier Tardieu, Jordi Torres, and Josep Ll. Berral. Mind the memory gap: Unveiling gpu bottlenecks in large-batch llm inference, 2025. URL https://arxiv.org/abs/2503.08311

work page arXiv 2025
[28]

Machine learning-guided memory optimization for dlrm infer- ence on tiered memory

Jie Ren, Bin Ma, Shuangyan Yang, Benjamin Francis, Ehsan K Ardestani, Min Si, and Dong Li. Machine learning-guided memory optimization for dlrm infer- ence on tiered memory. In2025 IEEE International Symposium on High Performance Computer Architec- ture (HPCA), pages 1631–1647. IEEE, 2025

work page 2025
[29]

Learning cache replacement with {CACHEUS}

Liana V Rodriguez, Farzana Yusuf, Steven Lyons, Eysler Paz, Raju Rangaswami, Jason Liu, Ming Zhao, and Giri Narasimhan. Learning cache replacement with {CACHEUS}. In19th USENIX Conference on File and Storage Technologies (FAST 21), pages 341–354, 2021

work page 2021
[30]

Near-optimal bounds for online caching with machine learned advice

Dhruv Rohatgi. Near-optimal bounds for online caching with machine learned advice. InProceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1834–1845. SIAM, 2020

work page 2020
[31]

Al- gorithms for caching and MTS with reduced number of predictions

Karim Ahmed Abdel Sadek and Marek Elias. Al- gorithms for caching and MTS with reduced number of predictions. InThe Twelfth International Confer- ence on Learning Representations, 2024. URL https: //openreview.net/forum?id=QuIiLSktO4

work page 2024
[32]

Applying deep learning to the cache replacement problem

Zhan Shi, Xiangru Huang, Akanksha Jain, and Calvin Lin. Applying deep learning to the cache replacement problem. InProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 413–425, 2019

work page 2019
[33]

Ugache: A unified gpu cache for embedding- based deep learning

Xiaoniu Song, Yiwen Zhang, Rong Chen, and Haibo Chen. Ugache: A unified gpu cache for embedding- based deep learning. InProceedings of the 29th Sympo- sium on Operating Systems Principles, pages 627–641, 2023

work page 2023
[34]

Learning relaxed belady for content distribution network caching

Zhenyu Song, Daniel S Berger, Kai Li, Anees Shaikh, Wyatt Lloyd, Soudeh Ghorbani, Changhoon Kim, Aditya Akella, Arvind Krishnamurthy, Emmett Witchel, et al. Learning relaxed belady for content distribution network caching. In17th USENIX Symposium on Net- worked Systems Design and Implementation (NSDI 20), pages 529–544, 2020

work page 2020
[35]

{HALP}: Heuristic aided learned preference eviction policy for {YouTube} content delivery network

Zhenyu Song, Kevin Chen, Nikhil Sarda, Deniz Al- tınbüken, Eugene Brevdo, Jimmy Coleman, Xiao Ju, Pawel Jurczyk, Richard Schooler, and Ramki Gummadi. {HALP}: Heuristic aided learned preference eviction policy for {YouTube} content delivery network. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 1149–1163, 2023

work page 2023
[36]

Qwen2 Technical Report

Qwen Team. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Aibrix: Towards scalable, cost-effective large language model inference infrastructure.arXiv preprint arXiv:2504.03648, 2025

The AIBrix Team, Jiaxin Shan, Varun Gupta, Le Xu, Haiyang Shi, Jingyuan Zhang, Ning Wang, Linhui Xu, 14 Rong Kang, Tongping Liu, et al. Aibrix: Towards scal- able, cost-effective large language model inference in- frastructure.arXiv preprint arXiv:2504.03648, 2025

work page arXiv 2025
[38]

Driving cache replacement with {ML-based}{LeCaR}

Giuseppe Vietri, Liana V Rodriguez, Wendy A Martinez, Steven Lyons, Jason Liu, Raju Rangaswami, Ming Zhao, and Giri Narasimhan. Driving cache replacement with {ML-based}{LeCaR}. In10th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 18), 2018

work page 2018
[39]

Kvcache cache in the wild: Char- acterizing and optimizing kvcache cache at a large cloud provider

Jiahao Wang, Jinbo Han, Xingda Wei, Sijie Shen, Dingyan Zhang, Chenguang Fang, Rong Chen, Wenyuan Yu, and Haibo Chen. Kvcache cache in the wild: Char- acterizing and optimizing kvcache cache at a large cloud provider. In2025 USENIX Annual Technical Confer- ence (USENIX ATC 25). USENIX Association, July

work page
[40]

URL https://www.usenix.org/conference/ atc25/presentation/wang-jiahao

work page
[41]

Abel, Xu Guo, Jianbing Dong, Ji Shi, and Kunlun Li

Zehuan Wang, Yingcan Wei, Minseok Lee, Matthias Langer, Fan Yu, Jie Liu, Shijie Liu, Daniel G. Abel, Xu Guo, Jianbing Dong, Ji Shi, and Kunlun Li. Merlin hugectr: Gpu-accelerated recommender system training and inference. InProceedings of the 16th ACM Con- ference on Recommender Systems, RecSys ’22, page 534–537, New York, NY , USA, 2022. Association for ...

work page doi:10.1145/3523227.3547405 2022
[42]

Better and simpler learning-augmented online caching.arXiv preprint arXiv:2005.13716, 2020

Alexander Wei. Better and simpler learning-augmented online caching.arXiv preprint arXiv:2005.13716, 2020

work page arXiv 2005
[43]

A gpu-specialized inference parameter server for large-scale deep recom- mendation models

Yingcan Wei, Matthias Langer, Fan Yu, Minseok Lee, Jie Liu, Ji Shi, and Zehuan Wang. A gpu-specialized inference parameter server for large-scale deep recom- mendation models. InProceedings of the 16th ACM Conference on Recommender Systems, pages 408–419, 2022

work page 2022
[44]

Fleche: an efficient gpu embedding cache for personalized recommendations

Minhui Xie, Youyou Lu, Jiazhen Lin, Qing Wang, Jian Gao, Kai Ren, and Jiwu Shu. Fleche: an efficient gpu embedding cache for personalized recommendations. In Proceedings of the Seventeenth European Conference on Computer Systems, pages 402–416, 2022

work page 2022
[45]

Learning from op- timal caching for content delivery

Gang Yan, Jian Li, and Don Towsley. Learning from op- timal caching for content delivery. InProceedings of the 17th International Conference on emerging Networking EXperiments and Technologies, pages 344–358, 2021

work page 2021
[46]

A learned cache eviction framework with mini- mal overhead.arXiv preprint arXiv:2301.11886, 2023

Dongsheng Yang, Daniel S Berger, Kai Li, and Wyatt Lloyd. A learned cache eviction framework with mini- mal overhead.arXiv preprint arXiv:2301.11886, 2023

work page arXiv 2023
[47]

{GL-Cache}: Group-level learning for efficient and high-performance caching

Juncheng Yang, Ziming Mao, Yao Yue, and KV Rashmi. {GL-Cache}: Group-level learning for efficient and high-performance caching. In21st USENIX Confer- ence on File and Storage Technologies (FAST 23), pages 115–134, 2023

work page 2023
[48]

{GPU-Disaggregated} serving for deep learning recommendation models at scale

Lingyun Yang, Yongchen Wang, Yinghao Yu, Qizhen Weng, Jianbo Dong, Kan Liu, Chi Zhang, Yanyi Zi, Hao Li, Zechao Zhang, et al. {GPU-Disaggregated} serving for deep learning recommendation models at scale. In 22nd USENIX Symposium on Networked Systems De- sign and Implementation (NSDI 25), pages 847–863, 2025

work page 2025
[49]

Tenrec: A large-scale multipurpose benchmark dataset for recommender systems.Advances in Neural Information Processing Systems, 35:11480– 11493, 2022

Guanghu Yuan, Fajie Yuan, Yudong Li, Beibei Kong, Shujie Li, Lei Chen, Min Yang, Chenyun Yu, Bo Hu, Zang Li, et al. Tenrec: A large-scale multipurpose benchmark dataset for recommender systems.Advances in Neural Information Processing Systems, 35:11480– 11493, 2022

work page 2022
[50]

{SIEVE} is simpler than {LRU}: an efficient {Turn-Key} eviction algorithm for web caches

Yazhuo Zhang, Juncheng Yang, Yao Yue, Ymir Vig- fusson, and KV Rashmi. {SIEVE} is simpler than {LRU}: an efficient {Turn-Key} eviction algorithm for web caches. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 1229–1246, 2024

work page 2024
[51]

Two-level graph caching for expediting distributed gnn training

Zhe Zhang, Ziyue Luo, and Chuan Wu. Two-level graph caching for expediting distributed gnn training. InIEEE INFOCOM 2023-IEEE Conference on Computer Com- munications, pages 1–10. IEEE, 2023

work page 2023
[52]

Song: Approxi- mate nearest neighbor search on gpu

Weijie Zhao, Shulong Tan, and Ping Li. Song: Approxi- mate nearest neighbor search on gpu. In2020 IEEE 36th International Conference on Data Engineering (ICDE), pages 1033–1044. IEEE, 2020

work page 2020
[53]

Sglang: Efficient execution of structured language model programs.Advances in Neural Information Pro- cessing Systems, 37:62557–62583, 2024

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in Neural Information Pro- cessing Systems, 37:62557–62583, 2024

work page 2024
[54]

Deep inter- est evolution network for click-through rate prediction

Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. Deep inter- est evolution network for click-through rate prediction. InProceedings of the AAAI conference on artificial in- telligence, volume 33, pages 5941–5948, 2019

work page 2019
[55]

{3L-Cache}: Low overhead and precise learning-based eviction policy for caches

Wenbin Zhou, Zhixiong Niu, Yongqiang Xiong, Juan Fang, and Qian Wang. {3L-Cache}: Low overhead and precise learning-based eviction policy for caches. In 23rd USENIX Conference on File and Storage Technolo- gies (FAST 25), pages 237–254, 2025. 15 Artifact Appendix Abstract The provided artifacts include all code and non-confidential traces to reproduce the...

work page 2025
[56]

The benchmark supports direct com- parison of GPU embedding cache policies, including FPB (following predictions blindly), LARU, and clas- sical LRU

SLS-Cache-Bench.For DLRMs, we develop a dedi- cated benchmark that isolates the SparseLengthsSum (SLS) operation from the HugeCTR framework and integrates LCR. The benchmark supports direct com- parison of GPU embedding cache policies, including FPB (following predictions blindly), LARU, and clas- sical LRU. We implement an alternative GPU cache in the fl...

work page
[57]

This integration en- ables systematic comparison of GPU KV cache policies, including FPB, LARU, and classical LRU

LCR-on-SGLang.For LLMs, we integrate LCR into SGLang (version 0.4.9.post2) by adding Light- GBM support, implementing an online training frame- work, and replacing the default RadixTree cache (radix_cache.py) with LARU. This integration en- ables systematic comparison of GPU KV cache policies, including FPB, LARU, and classical LRU. It comprises over 5K l...

work page
[58]

DLRMs: CUDA 12.4, PyTorch 2.6.0, and either devtoolset-9orgcc-9

work page
[59]

The hardware used to generate the results in this paper is described in Section 4.1 for DLRMs and Section 5.1 for LLMs

LLMs: CUDA 12.6, PyTorch 2.6.0, and the required third-party Python packages, lightgbm. The hardware used to generate the results in this paper is described in Section 4.1 for DLRMs and Section 5.1 for LLMs. Nevertheless, the experiments can also be conducted on other GPUs, provided that the available HBM capacity is sufficient to accommodate both the ML ...

work page
[60]

Bounded robustness.The robustness of LARU is O(k), as proved in Technical Appendix D. In other words, re- gardless of predictor accuracy, its competitive ratio is bounded by O(k), matching LRU’s k-competitive ra- tio asymptotically and safeguarding against catastrophic performance degradation or systemic risk

work page
[61]

Under perfect predic- tions, LARU matches the offline optimal algorithm, fully leveraging accurate machine-learned predictions

Ideal consistency.The consistency of LARU is 1, as proved in Technical Appendix E. Under perfect predic- tions, LARU matches the offline optimal algorithm, fully leveraging accurate machine-learned predictions

work page
[62]

In DLRMs, embedding caches are partitioned into buckets (e.g., SlabHash in HugeCTR), each holding only tens of items (e.g., 64)

Low time complexity.The algorithm runs in O(logk) amortized time per request, which is practical since k is small in practice. In DLRMs, embedding caches are partitioned into buckets (e.g., SlabHash in HugeCTR), each holding only tens of items (e.g., 64). In LLMs, KV cache vectors are typically organized viaRadixTrees in SGLang or paged-attention in vLLM,...

work page
[63]

Reduced predictor usage.In asynchronous mode, pre- diction tasks are triggered only under specific conditions, such as at most once per fixed interval or after the pre- vious task has completed, preventing excessive usage. In synchronous mode, once a prediction-induced cache miss is detected (Line 14 in Algorithm 1), LARU falls back to LRU once, limiting ...

work page
[64]

the requested item is a new item that has not yet been evicted in the current phase

work page
[65]

The number of occurrences of case (1) equals the num- ber of distinct new items, ci

the requested item was previously evicted by LRU’s policy within the same phase. The number of occurrences of case (1) equals the num- ber of distinct new items, ci. Note that after logb(k) occur- rences of case (2), λ has become 1/blogb(k), the candidate set L shrinks to size 1, and the eviction step at Line 25 degener- ates to LRU’s policy. Hence, the n...

work page

[1] [1]

URL https://huggingface.co/ datasets/philschmid/sharegpt-raw/tree/main/ sharegpt_90k_raw_dataset

Sharegpt raw. URL https://huggingface.co/ datasets/philschmid/sharegpt-raw/tree/main/ sharegpt_90k_raw_dataset

work page

[2] [2]

URL https://crc2.ece.tamu.edu/

The 2nd cache replacement championship, 2017. URL https://crc2.ece.tamu.edu/

work page 2017

[3] [3]

A dynamic hash table for the gpu

Saman Ashkiani, Martin Farach-Colton, and John D Owens. A dynamic hash table for the gpu. In2018 IEEE international parallel and distributed processing symposium (IPDPS), pages 419–429. IEEE, 2018

work page 2018

[4] [4]

{LHD}: Improving cache hit rate by maximizing hit density

Nathan Beckmann, Haoxian Chen, and Asaf Cidon. {LHD}: Improving cache hit rate by maximizing hit density. In15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), pages 389–403, 2018

work page 2018

[5] [5]

Online computation and competitive analysis, 1998

Allan Borodin and Ran El-Yaniv. Online computation and competitive analysis, 1998

work page 1998

[6] [6]

Updlrm: Accelerating personalized recommendation using real-world pim architecture

Sitian Chen, Haobin Tan, Amelie Chi Zhou, Yusen Li, and Pavan Balaji. Updlrm: Accelerating personalized recommendation using real-world pim architecture. In Proceedings of the 61st ACM/IEEE Design Automation Conference, pages 1–6, 2024

work page 2024

[7] [7]

Ecco: Improving mem- ory bandwidth and capacity for llms via entropy-aware cache compression

Feng Cheng, Cong Guo, Chiyue Wei, Junyao Zhang, Changchun Zhou, Edward Hanson, Jiaqi Zhang, Xiaox- iao Liu, Hai Li, and Yiran Chen. Ecco: Improving mem- ory bandwidth and capacity for llms via entropy-aware cache compression. InProceedings of the 52nd An- nual International Symposium on Computer Architec- ture, pages 793–807, 2025

work page 2025

[8] [8]

Bandana: Using non- volatile memory for storing deep learning models.Pro- ceedings of machine learning and systems, 1:40–52, 2019

Assaf Eisenman, Maxim Naumov, Darryl Gardner, Misha Smelyanskiy, Sergey Pupyrev, Kim Hazelwood, Asaf Cidon, and Sachin Katti. Bandana: Using non- volatile memory for storing deep learning models.Pro- ceedings of machine learning and systems, 1:40–52, 2019

work page 2019

[9] [9]

Compet- itive paging algorithms.Journal of Algorithms, 12(4): 685–699, 1991

Amos Fiat, Richard M Karp, Michael Luby, Lyle A Mc- Geoch, Daniel D Sleator, and Neal E Young. Compet- itive paging algorithms.Journal of Algorithms, 12(4): 685–699, 1991

work page 1991

[10] [10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Raven: belady-guided, predictive (deep) learning for in-memory and content caching

Xinyue Hu, Eman Ramadan, Wei Ye, Feng Tian, and Zhi-Li Zhang. Raven: belady-guided, predictive (deep) learning for in-memory and content caching. InProceed- ings of the 18th International Conference on emerging Networking EXperiments and Technologies, pages 72– 90, 2022

work page 2022

[12] [12]

Back to the future: Lever- aging belady’s algorithm for improved cache replace- ment.ACM SIGARCH Computer Architecture News, 44 (3):78–89, 2016

Akanksha Jain and Calvin Lin. Back to the future: Lever- aging belady’s algorithm for improved cache replace- ment.ACM SIGARCH Computer Architecture News, 44 (3):78–89, 2016

work page 2016

[13] [13]

Microrec: Effi- cient recommendation inference by hardware and data structure solutions.Proceedings of Machine Learning and Systems, 3:845–859, 2021

Wenqi Jiang, Zhenhao He, Shuai Zhang, Thomas B Preußer, Kai Zeng, Liang Feng, Jiansong Zhang, Tongx- uan Liu, Yong Li, Jingren Zhou, et al. Microrec: Effi- cient recommendation inference by hardware and data structure solutions.Proceedings of Machine Learning and Systems, 3:845–859, 2021

work page 2021

[14] [14]

Fleetrec: Large-scale recommendation inference on hybrid gpu-fpga clusters

Wenqi Jiang, Zhenhao He, Shuai Zhang, Kai Zeng, Liang Feng, Jiansong Zhang, Tongxuan Liu, Yong Li, Jingren Zhou, Ce Zhang, et al. Fleetrec: Large-scale recommendation inference on hybrid gpu-fpga clusters. InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 3097– 3105, 2021

work page 2021

[15] [15]

Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems, 30, 2017

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems, 30, 2017

work page 2017

[16] [16]

Rec- nmp: Accelerating personalized recommendation with near-memory processing

Liu Ke, Udit Gupta, Benjamin Youngjae Cho, David Brooks, Vikas Chandra, Utku Diril, Amin Firoozshahian, Kim Hazelwood, Bill Jia, Hsien-Hsin S Lee, et al. Rec- nmp: Accelerating personalized recommendation with near-memory processing. In2020 ACM/IEEE 47th An- nual International Symposium on Computer Architec- ture (ISCA), pages 790–803. IEEE, 2020

work page 2020

[17] [17]

Evstore: Storage and caching ca- pabilities for scaling embedding tables in deep recom- mendation systems

Daniar H Kurniawan, Ruipu Wang, Kahfi S Zulkifli, Fandi A Wiranata, John Bent, Ymir Vigfusson, and Haryadi S Gunawi. Evstore: Storage and caching ca- pabilities for scaling embedding tables in deep recom- mendation systems. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vol-...

work page 2023

[18] [18]

Efficient memory man- agement for large language model serving with page- dattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023. 13

work page 2023

[19] [19]

Tensor- dimm: A practical near-memory processing architecture for embeddings and tensor operations in deep learning

Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. Tensor- dimm: A practical near-memory processing architecture for embeddings and tensor operations in deep learning. InProceedings of the 52nd Annual IEEE/ACM Interna- tional Symposium on Microarchitecture, pages 740–753, 2019

work page 2019

[20] [20]

{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache manage- ment

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache manage- ment. In18th USENIX Symposium on Operating Sys- tems Design and Implementation (OSDI 24), pages 155– 172, 2024

work page 2024

[21] [21]

Merci: efficient embedding reduction on commodity hardware via sub-query memoization

Yejin Lee, Seong Hoon Seo, Hyunji Choi, Hyoung Uk Sul, Soosung Kim, Jae W Lee, and Tae Jun Ham. Merci: efficient embedding reduction on commodity hardware via sub-query memoization. InProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 302–313, 2021

work page 2021

[22] [22]

An imitation learn- ing approach for cache replacement

Evan Liu, Milad Hashemi, Kevin Swersky, Parthasarathy Ranganathan, and Junwhan Ahn. An imitation learn- ing approach for cache replacement. InInternational Conference on Machine Learning, pages 6237–6247. PMLR, 2020

work page 2020

[23] [23]

Competitive caching with machine learned advice

Thodoris Lykouris and Sergei Vassilvtiskii. Competitive caching with machine learned advice. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th Inter- national Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 3296–3305. PMLR, 10–15 Jul 2018. URL https:// proceedings.mlr.press/v80/lykouris18a.html

work page 2018

[24] [24]

Algo- rithms with predictions.Commun

Michael Mitzenmacher and Sergei Vassilvitskii. Algo- rithms with predictions.Commun. ACM, 65(7):33–35, June 2022. ISSN 0001-0782. doi: 10.1145/3528087. URLhttps://doi.org/10.1145/3528087

work page doi:10.1145/3528087 2022

[25] [25]

Algo- rithms with predictions.Communications of the ACM, 65(7):33–35, 2022

Michael Mitzenmacher and Sergei Vassilvitskii. Algo- rithms with predictions.Communications of the ACM, 65(7):33–35, 2022

work page 2022

[26] [26]

Flips: A flexible par- titioning strategy near memory processing architecture for recommendation system.IEEE Transactions on Parallel and Distributed Systems, 2025

Yudi Qiu, Lingfei Lu, Shiyan Yi, Minge Jing, Xiaoyang Zeng, Yang Kong, and Yibo Fan. Flips: A flexible par- titioning strategy near memory processing architecture for recommendation system.IEEE Transactions on Parallel and Distributed Systems, 2025

work page 2025

[27] [27]

Recasens, Ferran Agullo, Yue Zhu, Chen Wang, Eun Kyung Lee, Olivier Tardieu, Jordi Torres, and Josep Ll

Pol G. Recasens, Ferran Agullo, Yue Zhu, Chen Wang, Eun Kyung Lee, Olivier Tardieu, Jordi Torres, and Josep Ll. Berral. Mind the memory gap: Unveiling gpu bottlenecks in large-batch llm inference, 2025. URL https://arxiv.org/abs/2503.08311

work page arXiv 2025

[28] [28]

Machine learning-guided memory optimization for dlrm infer- ence on tiered memory

Jie Ren, Bin Ma, Shuangyan Yang, Benjamin Francis, Ehsan K Ardestani, Min Si, and Dong Li. Machine learning-guided memory optimization for dlrm infer- ence on tiered memory. In2025 IEEE International Symposium on High Performance Computer Architec- ture (HPCA), pages 1631–1647. IEEE, 2025

work page 2025

[29] [29]

Learning cache replacement with {CACHEUS}

Liana V Rodriguez, Farzana Yusuf, Steven Lyons, Eysler Paz, Raju Rangaswami, Jason Liu, Ming Zhao, and Giri Narasimhan. Learning cache replacement with {CACHEUS}. In19th USENIX Conference on File and Storage Technologies (FAST 21), pages 341–354, 2021

work page 2021

[30] [30]

Near-optimal bounds for online caching with machine learned advice

Dhruv Rohatgi. Near-optimal bounds for online caching with machine learned advice. InProceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1834–1845. SIAM, 2020

work page 2020

[31] [31]

Al- gorithms for caching and MTS with reduced number of predictions

Karim Ahmed Abdel Sadek and Marek Elias. Al- gorithms for caching and MTS with reduced number of predictions. InThe Twelfth International Confer- ence on Learning Representations, 2024. URL https: //openreview.net/forum?id=QuIiLSktO4

work page 2024

[32] [32]

Applying deep learning to the cache replacement problem

Zhan Shi, Xiangru Huang, Akanksha Jain, and Calvin Lin. Applying deep learning to the cache replacement problem. InProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 413–425, 2019

work page 2019

[33] [33]

Ugache: A unified gpu cache for embedding- based deep learning

Xiaoniu Song, Yiwen Zhang, Rong Chen, and Haibo Chen. Ugache: A unified gpu cache for embedding- based deep learning. InProceedings of the 29th Sympo- sium on Operating Systems Principles, pages 627–641, 2023

work page 2023

[34] [34]

Learning relaxed belady for content distribution network caching

Zhenyu Song, Daniel S Berger, Kai Li, Anees Shaikh, Wyatt Lloyd, Soudeh Ghorbani, Changhoon Kim, Aditya Akella, Arvind Krishnamurthy, Emmett Witchel, et al. Learning relaxed belady for content distribution network caching. In17th USENIX Symposium on Net- worked Systems Design and Implementation (NSDI 20), pages 529–544, 2020

work page 2020

[35] [35]

{HALP}: Heuristic aided learned preference eviction policy for {YouTube} content delivery network

Zhenyu Song, Kevin Chen, Nikhil Sarda, Deniz Al- tınbüken, Eugene Brevdo, Jimmy Coleman, Xiao Ju, Pawel Jurczyk, Richard Schooler, and Ramki Gummadi. {HALP}: Heuristic aided learned preference eviction policy for {YouTube} content delivery network. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 1149–1163, 2023

work page 2023

[36] [36]

Qwen2 Technical Report

Qwen Team. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Aibrix: Towards scalable, cost-effective large language model inference infrastructure.arXiv preprint arXiv:2504.03648, 2025

The AIBrix Team, Jiaxin Shan, Varun Gupta, Le Xu, Haiyang Shi, Jingyuan Zhang, Ning Wang, Linhui Xu, 14 Rong Kang, Tongping Liu, et al. Aibrix: Towards scal- able, cost-effective large language model inference in- frastructure.arXiv preprint arXiv:2504.03648, 2025

work page arXiv 2025

[38] [38]

Driving cache replacement with {ML-based}{LeCaR}

Giuseppe Vietri, Liana V Rodriguez, Wendy A Martinez, Steven Lyons, Jason Liu, Raju Rangaswami, Ming Zhao, and Giri Narasimhan. Driving cache replacement with {ML-based}{LeCaR}. In10th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 18), 2018

work page 2018

[39] [39]

Kvcache cache in the wild: Char- acterizing and optimizing kvcache cache at a large cloud provider

Jiahao Wang, Jinbo Han, Xingda Wei, Sijie Shen, Dingyan Zhang, Chenguang Fang, Rong Chen, Wenyuan Yu, and Haibo Chen. Kvcache cache in the wild: Char- acterizing and optimizing kvcache cache at a large cloud provider. In2025 USENIX Annual Technical Confer- ence (USENIX ATC 25). USENIX Association, July

work page

[40] [40]

URL https://www.usenix.org/conference/ atc25/presentation/wang-jiahao

work page

[41] [41]

Abel, Xu Guo, Jianbing Dong, Ji Shi, and Kunlun Li

Zehuan Wang, Yingcan Wei, Minseok Lee, Matthias Langer, Fan Yu, Jie Liu, Shijie Liu, Daniel G. Abel, Xu Guo, Jianbing Dong, Ji Shi, and Kunlun Li. Merlin hugectr: Gpu-accelerated recommender system training and inference. InProceedings of the 16th ACM Con- ference on Recommender Systems, RecSys ’22, page 534–537, New York, NY , USA, 2022. Association for ...

work page doi:10.1145/3523227.3547405 2022

[42] [42]

Better and simpler learning-augmented online caching.arXiv preprint arXiv:2005.13716, 2020

Alexander Wei. Better and simpler learning-augmented online caching.arXiv preprint arXiv:2005.13716, 2020

work page arXiv 2005

[43] [43]

A gpu-specialized inference parameter server for large-scale deep recom- mendation models

Yingcan Wei, Matthias Langer, Fan Yu, Minseok Lee, Jie Liu, Ji Shi, and Zehuan Wang. A gpu-specialized inference parameter server for large-scale deep recom- mendation models. InProceedings of the 16th ACM Conference on Recommender Systems, pages 408–419, 2022

work page 2022

[44] [44]

Fleche: an efficient gpu embedding cache for personalized recommendations

Minhui Xie, Youyou Lu, Jiazhen Lin, Qing Wang, Jian Gao, Kai Ren, and Jiwu Shu. Fleche: an efficient gpu embedding cache for personalized recommendations. In Proceedings of the Seventeenth European Conference on Computer Systems, pages 402–416, 2022

work page 2022

[45] [45]

Learning from op- timal caching for content delivery

Gang Yan, Jian Li, and Don Towsley. Learning from op- timal caching for content delivery. InProceedings of the 17th International Conference on emerging Networking EXperiments and Technologies, pages 344–358, 2021

work page 2021

[46] [46]

A learned cache eviction framework with mini- mal overhead.arXiv preprint arXiv:2301.11886, 2023

Dongsheng Yang, Daniel S Berger, Kai Li, and Wyatt Lloyd. A learned cache eviction framework with mini- mal overhead.arXiv preprint arXiv:2301.11886, 2023

work page arXiv 2023

[47] [47]

{GL-Cache}: Group-level learning for efficient and high-performance caching

Juncheng Yang, Ziming Mao, Yao Yue, and KV Rashmi. {GL-Cache}: Group-level learning for efficient and high-performance caching. In21st USENIX Confer- ence on File and Storage Technologies (FAST 23), pages 115–134, 2023

work page 2023

[48] [48]

{GPU-Disaggregated} serving for deep learning recommendation models at scale

Lingyun Yang, Yongchen Wang, Yinghao Yu, Qizhen Weng, Jianbo Dong, Kan Liu, Chi Zhang, Yanyi Zi, Hao Li, Zechao Zhang, et al. {GPU-Disaggregated} serving for deep learning recommendation models at scale. In 22nd USENIX Symposium on Networked Systems De- sign and Implementation (NSDI 25), pages 847–863, 2025

work page 2025

[49] [49]

Tenrec: A large-scale multipurpose benchmark dataset for recommender systems.Advances in Neural Information Processing Systems, 35:11480– 11493, 2022

Guanghu Yuan, Fajie Yuan, Yudong Li, Beibei Kong, Shujie Li, Lei Chen, Min Yang, Chenyun Yu, Bo Hu, Zang Li, et al. Tenrec: A large-scale multipurpose benchmark dataset for recommender systems.Advances in Neural Information Processing Systems, 35:11480– 11493, 2022

work page 2022

[50] [50]

{SIEVE} is simpler than {LRU}: an efficient {Turn-Key} eviction algorithm for web caches

Yazhuo Zhang, Juncheng Yang, Yao Yue, Ymir Vig- fusson, and KV Rashmi. {SIEVE} is simpler than {LRU}: an efficient {Turn-Key} eviction algorithm for web caches. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 1229–1246, 2024

work page 2024

[51] [51]

Two-level graph caching for expediting distributed gnn training

Zhe Zhang, Ziyue Luo, and Chuan Wu. Two-level graph caching for expediting distributed gnn training. InIEEE INFOCOM 2023-IEEE Conference on Computer Com- munications, pages 1–10. IEEE, 2023

work page 2023

[52] [52]

Song: Approxi- mate nearest neighbor search on gpu

Weijie Zhao, Shulong Tan, and Ping Li. Song: Approxi- mate nearest neighbor search on gpu. In2020 IEEE 36th International Conference on Data Engineering (ICDE), pages 1033–1044. IEEE, 2020

work page 2020

[53] [53]

Sglang: Efficient execution of structured language model programs.Advances in Neural Information Pro- cessing Systems, 37:62557–62583, 2024

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in Neural Information Pro- cessing Systems, 37:62557–62583, 2024

work page 2024

[54] [54]

Deep inter- est evolution network for click-through rate prediction

Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. Deep inter- est evolution network for click-through rate prediction. InProceedings of the AAAI conference on artificial in- telligence, volume 33, pages 5941–5948, 2019

work page 2019

[55] [55]

{3L-Cache}: Low overhead and precise learning-based eviction policy for caches

Wenbin Zhou, Zhixiong Niu, Yongqiang Xiong, Juan Fang, and Qian Wang. {3L-Cache}: Low overhead and precise learning-based eviction policy for caches. In 23rd USENIX Conference on File and Storage Technolo- gies (FAST 25), pages 237–254, 2025. 15 Artifact Appendix Abstract The provided artifacts include all code and non-confidential traces to reproduce the...

work page 2025

[56] [56]

The benchmark supports direct com- parison of GPU embedding cache policies, including FPB (following predictions blindly), LARU, and clas- sical LRU

SLS-Cache-Bench.For DLRMs, we develop a dedi- cated benchmark that isolates the SparseLengthsSum (SLS) operation from the HugeCTR framework and integrates LCR. The benchmark supports direct com- parison of GPU embedding cache policies, including FPB (following predictions blindly), LARU, and clas- sical LRU. We implement an alternative GPU cache in the fl...

work page

[57] [57]

This integration en- ables systematic comparison of GPU KV cache policies, including FPB, LARU, and classical LRU

LCR-on-SGLang.For LLMs, we integrate LCR into SGLang (version 0.4.9.post2) by adding Light- GBM support, implementing an online training frame- work, and replacing the default RadixTree cache (radix_cache.py) with LARU. This integration en- ables systematic comparison of GPU KV cache policies, including FPB, LARU, and classical LRU. It comprises over 5K l...

work page

[58] [58]

DLRMs: CUDA 12.4, PyTorch 2.6.0, and either devtoolset-9orgcc-9

work page

[59] [59]

The hardware used to generate the results in this paper is described in Section 4.1 for DLRMs and Section 5.1 for LLMs

LLMs: CUDA 12.6, PyTorch 2.6.0, and the required third-party Python packages, lightgbm. The hardware used to generate the results in this paper is described in Section 4.1 for DLRMs and Section 5.1 for LLMs. Nevertheless, the experiments can also be conducted on other GPUs, provided that the available HBM capacity is sufficient to accommodate both the ML ...

work page

[60] [60]

Bounded robustness.The robustness of LARU is O(k), as proved in Technical Appendix D. In other words, re- gardless of predictor accuracy, its competitive ratio is bounded by O(k), matching LRU’s k-competitive ra- tio asymptotically and safeguarding against catastrophic performance degradation or systemic risk

work page

[61] [61]

Under perfect predic- tions, LARU matches the offline optimal algorithm, fully leveraging accurate machine-learned predictions

Ideal consistency.The consistency of LARU is 1, as proved in Technical Appendix E. Under perfect predic- tions, LARU matches the offline optimal algorithm, fully leveraging accurate machine-learned predictions

work page

[62] [62]

In DLRMs, embedding caches are partitioned into buckets (e.g., SlabHash in HugeCTR), each holding only tens of items (e.g., 64)

Low time complexity.The algorithm runs in O(logk) amortized time per request, which is practical since k is small in practice. In DLRMs, embedding caches are partitioned into buckets (e.g., SlabHash in HugeCTR), each holding only tens of items (e.g., 64). In LLMs, KV cache vectors are typically organized viaRadixTrees in SGLang or paged-attention in vLLM,...

work page

[63] [63]

Reduced predictor usage.In asynchronous mode, pre- diction tasks are triggered only under specific conditions, such as at most once per fixed interval or after the pre- vious task has completed, preventing excessive usage. In synchronous mode, once a prediction-induced cache miss is detected (Line 14 in Algorithm 1), LARU falls back to LRU once, limiting ...

work page

[64] [64]

the requested item is a new item that has not yet been evicted in the current phase

work page

[65] [65]

The number of occurrences of case (1) equals the num- ber of distinct new items, ci

the requested item was previously evicted by LRU’s policy within the same phase. The number of occurrences of case (1) equals the num- ber of distinct new items, ci. Note that after logb(k) occur- rences of case (2), λ has become 1/blogb(k), the candidate set L shrinks to size 1, and the eviction step at Line 25 degener- ates to LRU’s policy. Hence, the n...

work page