Recognition: unknown
MTServe: Efficient Serving for Generative Recommendation Models with Hierarchical Caches
Pith reviewed 2026-05-08 12:24 UTC · model grok-4.3
The pith
MTServe virtualizes GPU memory using host RAM and targeted optimizations to speed up generative recommendation serving by up to 3.1 times.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MTServe is a hierarchical cache management system that virtualizes GPU memory by leveraging host RAM as a scalable backup store for the massive key-value caches generated by generative recommendation models. It bridges the I/O gap between tiers through a hybrid storage layout, an asynchronous data transfer pipeline, and a locality-driven replacement policy, delivering up to 3.1 times speedup while preserving hit ratios above 98.5 percent on both public and production datasets.
What carries the argument
The hierarchical cache management system that treats host RAM as GPU memory extension, using hybrid storage layout, asynchronous pipeline, and locality-driven replacement to minimize I/O overhead.
If this is right
- Generative recommendation inference becomes practical at scale even when per-user state exceeds single-GPU limits.
- High cache hit ratios above 98.5 percent reduce repeated history encoding across requests.
- The combination of hybrid layout and async transfers sustains performance when data must move between memory tiers.
- Production systems can handle longer user histories without proportional growth in serving latency.
- Similar virtualization tactics can support other models that generate large reusable state during inference.
Where Pith is reading between the lines
- The same hierarchical approach could extend to long-context language models where KV cache sizes also exceed GPU capacity.
- Locality-driven replacement may prove useful in other recommendation or retrieval systems that exhibit temporal access patterns.
- If traffic patterns differ from the tested workloads, the async pipeline might require additional tuning to maintain gains.
- Storage virtualization at the serving layer offers a general path for memory-bound machine learning inference tasks.
Load-bearing premise
The system-level optimizations can bridge the I/O gap between GPU and host RAM without introducing overheads that erase the speedup in real production traffic.
What would settle it
Deploying MTServe under bursty production traffic and measuring whether end-to-end latency improvement drops below 1 times due to transfer overheads would directly test the claim.
Figures
read the original abstract
Generative recommendation (GR) offers superior modeling capabilities but suffers from prohibitive inference costs due to the repeated encoding of long user histories. While cross-request Key-Value (KV) cache reuse presents a significant optimization opportunity, the massive scale of individual user states creates a storage explosion that far exceeds physical GPU limits. We propose MTServe, a hierarchical cache management system that virtualizes GPU memory by leveraging host RAM as a scalable backup store. To bridge the I/O gap between tiers, MTServe introduces a suite of system-level optimizations, including a hybrid storage layout, an asynchronous data transfer pipeline, and a locality-driven replacement policy. On both public and production datasets, MTServe delivers up to 3.1* speedup while maintaining near-perfect hit ratios (>98.5%).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MTServe, a hierarchical cache management system for serving generative recommendation models. It virtualizes GPU memory by using host RAM as a backup store for the large user-state KV caches that arise from long histories, and introduces three system optimizations (hybrid storage layout, asynchronous data transfer pipeline, and locality-driven replacement policy) to hide cross-tier I/O latency. Empirical evaluation on public and production datasets is reported to yield up to 3.1× speedup while preserving hit ratios above 98.5%.
Significance. If the performance claims are shown to be robust, the work would be significant for practical deployment of generative recommendation systems, which currently face prohibitive inference costs from repeated history encoding. The approach of treating host RAM as a first-class extension of GPU memory, together with the concrete optimizations for overlap and locality, could inform future serving stacks for large-scale sequence models.
major comments (2)
- [Abstract] Abstract: the central performance claims (3.1× speedup, >98.5% hit ratio) are stated without any description of experimental setup, baselines, hardware, concurrency levels, or statistical variance. Because the paper's contribution is empirical, this omission makes it impossible to assess whether the reported gains are load-bearing or reproducible.
- [System Design and Evaluation] System optimizations and evaluation: the hybrid layout, asynchronous pipeline, and locality-driven policy are presented as the mechanisms that keep transfers overlapped with computation, yet no per-component latency breakdowns, transfer-vs-compute overlap measurements, or results under bursty/high-concurrency workloads are supplied. Without these data the claim that I/O latency is fully hidden cannot be verified and remains a load-bearing assumption for the speedup result.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving the clarity and verifiability of our empirical claims and system evaluation. We address each major comment below and commit to revisions that will strengthen the paper without altering its core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central performance claims (3.1× speedup, >98.5% hit ratio) are stated without any description of experimental setup, baselines, hardware, concurrency levels, or statistical variance. Because the paper's contribution is empirical, this omission makes it impossible to assess whether the reported gains are load-bearing or reproducible.
Authors: We agree that the abstract would benefit from additional context on the experimental conditions to enhance interpretability and reproducibility. In the revised version, we will expand the abstract with a concise description of the evaluation setup, including the public and production datasets, hardware configuration (GPUs augmented with host RAM), comparison baselines, concurrency levels tested, and that speedups are reported as averages with low variance across runs. This will address the concern while remaining within typical abstract length limits. revision: yes
-
Referee: [System Design and Evaluation] System optimizations and evaluation: the hybrid layout, asynchronous pipeline, and locality-driven policy are presented as the mechanisms that keep transfers overlapped with computation, yet no per-component latency breakdowns, transfer-vs-compute overlap measurements, or results under bursty/high-concurrency workloads are supplied. Without these data the claim that I/O latency is fully hidden cannot be verified and remains a load-bearing assumption for the speedup result.
Authors: We acknowledge the value of more granular evidence for the system optimizations. The current manuscript focuses on end-to-end results, but we will add a dedicated micro-benchmark subsection in the evaluation. This will include per-component latency breakdowns, direct measurements of transfer-compute overlap, and performance under high-concurrency and bursty workloads. These additions will provide the necessary data to verify that I/O latency is effectively hidden by the hybrid layout, asynchronous pipeline, and locality-driven policy. revision: yes
Circularity Check
No circularity; empirical system evaluation with no derivations or fitted predictions
full rationale
The paper describes a hierarchical caching system (MTServe) for generative recommendation inference, proposing concrete optimizations (hybrid layout, async pipeline, locality-driven policy) and reporting measured speedups and hit ratios on datasets. No equations, first-principles derivations, parameter fits, or predictions appear; all load-bearing claims are direct empirical outcomes from implementation and benchmarking. These results are externally falsifiable via reproduction and do not reduce to self-definition or self-citation chains.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Host RAM can serve as a reliable, lower-latency backup to GPU memory for KV cache data in recommendation workloads.
- domain assumption User history access patterns exhibit sufficient locality to make replacement policies effective.
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al
-
[4]
InProceedings of the 1st workshop on deep learning for recommender systems
Wide & deep learning for recommender systems. InProceedings of the 1st workshop on deep learning for recommender systems. 7–10
-
[5]
Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. InProceedings of the 10th ACM conference on recommender systems. 191–198
2016
- [6]
-
[7]
Jiaxin Deng, Shiyao Wang, Kuo Cai, Lejian Ren, Qigen Hu, Weifeng Ding, Qiang Luo, and Guorui Zhou. 2025. Onerec: Unifying retrieve and rank with generative recommender and iterative preference alignment.arXiv preprint arXiv:2502.18965 (2025)
work page internal anchor Pith review arXiv 2025
- [8]
-
[9]
Yue Dong, Han Li, Shen Li, Nikhil Patel, Xing Liu, Xiaodong Wang, and Chuanhao Zhuge. 2025. Scaling Generative Recommendations with Context Parallelism on Hierarchical Sequential Transducers. InProceedings of the Nineteenth ACM Conference on Recommender Systems. 1058–1061
2025
-
[10]
Huifeng Guo, TANG Ruiming, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelli- gence. International Joint Conferences on Artificial Intelligence Organization
2017
-
[11]
Ruidong Han, Bin Yin, Shangyu Chen, He Jiang, Fei Jiang, Xiang Li, Chi Ma, Mincong Huang, Xiaoguang Li, Chunzhen Jing, et al . 2025. Mtgr: Industrial- scale generative recommendation framework in meituan. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 5731–5738
2025
- [12]
-
[13]
Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recom- mendation. In2018 IEEE international conference on data mining (ICDM). IEEE, 197–206
2018
-
[14]
Tongyoung Kim, Soojin Yoon, Seongku Kang, Jinyoung Yeo, and Dongha Lee
-
[15]
SC-Rec: Enhancing generative retrieval with self-consistent reranking for sequential recommendation.arXiv preprint arXiv:2408.08686(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles. 611–626
2023
-
[17]
Yaoyiran Li, Xiang Zhai, Moustafa Alzantot, Keyi Yu, Ivan Vulić, Anna Korhonen, and Mohamed Hammad. 2024. Calrec: Contrastive alignment of generative llms for sequential recommendation. InProceedings of the 18th ACM Conference on Recommender Systems. 422–432
2024
- [18]
-
[19]
Enze Liu, Bowen Zheng, Cheng Ling, Lantao Hu, Han Li, and Wayne Xin Zhao
-
[20]
InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval
Generative recommender with end-to-end learnable item tokenization. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 729–739
-
[21]
Xinchen Luo, Jiangxia Cao, Tianyu Sun, Jinkai Yu, Rui Huang, Wei Yuan, Hezheng Lin, Yichen Zheng, Shiyao Wang, Qigen Hu, et al . 2025. Qarm: Quantitative alignment multi-modal recommendation at kuaishou. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 5915– 5922
2025
-
[22]
NVIDIA. 2023. FasterTransformer. https://github.com/NVIDIA/ FasterTransformer
2023
- [23]
-
[24]
Gustavo Penha, Ali Vardasbi, Enrico Palumbo, Marco De Nadai, and Hugues Bouchard. 2024. Bridging Search and Recommendation in Generative Retrieval: Does One Task Help the Other?. InProceedings of the 18th ACM Conference on Recommender Systems. 340–349
2024
-
[25]
Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. InProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 701–710
2014
-
[26]
Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Tran, Jonah Samost, et al
-
[27]
Recommender systems with generative retrieval.Advances in Neural Information Processing Systems36 (2023), 10299–10315
2023
- [28]
-
[29]
Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang
-
[30]
InProceedings of the 28th ACM international conference on information and knowledge management
BERT4Rec: Sequential recommendation with bidirectional encoder rep- resentations from transformer. InProceedings of the 28th ACM international conference on information and knowledge management. 1441–1450
-
[31]
Jie Sun, Shaohang Wang, Zimo Zhang, Zhengyu Liu, Yunlong Xu, Peng Sun, Bo Zhao, Bingsheng He, Fei Wu, and Zeke Wang. 2026. Bat: Efficient Generative Recommender Serving with Bipartite Attention. InProceedings of the 31st In- ternational Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)
2026
-
[32]
Chunqi Wang, Bingchao Wu, Zheng Chen, Lei Shen, Bing Wang, and Xiaoyi Zeng
-
[33]
InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V
Scaling transformers for discriminative recommendation via generative pretraining. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 2893–2903
-
[34]
Jizhe Wang, Pipei Huang, Huan Zhao, Zhibo Zhang, Binqiang Zhao, and Dik Lun Lee. 2018. Billion-scale commodity embedding for e-commerce recommendation in alibaba. InProceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 839–848
2018
-
[35]
Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & cross network for ad click predictions. InProceedings of the ADKDD’17. 1–7
2017
-
[36]
Yuxiang Wang, Xiao Yan, Chi Ma, Mincong Huang, Xiaoguang Li, Lei Yu, Chuan Liu, Ruidong Han, He Jiang, Bin Yin, Shangyu Chen, Fei Jiang, Xiang Li, Wei Lin, Haowei Han, Bo Du, and Jiawei Jiang. 2025. MTGenRec: An Efficient Dis- tributed Training System for Generative Recommendation Models in Meituan. arXiv:2505.12663 [cs.DC] https://arxiv.org/abs/2505.12663
-
[37]
Songpei Xu, Shijia Wang, Da Guo, Xianwen Guo, Qiang Xiao, Bin Huang, Guanlin Wu, and Chuanjiang Luo. 2025. Climber: Toward Efficient Scaling Laws for Large Recommendation Models. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 6193–6200
2025
- [38]
- [39]
-
[40]
Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Jiayuan He, et al . 2024. Actions speak louder than words: trillion-parameter sequential transducers for generative recommendations. InProceedings of the 41st International Conference on Machine Learning. 58484– 58509
2024
- [41]
-
[42]
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems37 (2024), 62557–62583
2024
- [43]
-
[44]
Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Deep interest evolution network for click-through rate prediction. InProceedings of the AAAI conference on artificial intelligence, Vol. 33. 5941–5948
2019
-
[45]
Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through rate prediction. InProceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1059–1068
2018
-
[46]
workbench
Yaochen Zhu, Harald Steck, Dawen Liang, Yinhan He, Nathan Kallus, and Jundong Li. 2025. Llm-based conversational recommendation agents with collaborative verbalized experience.Proceedings of the Proc. of EMNLP Findings(2025), 2207– 2220. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Wang et al. A Pseudocode Algorithm 1 illustrates the inference ...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.