pith. machine review for the scientific record. sign in

arxiv: 2605.04450 · v1 · submitted 2026-05-06 · 💻 cs.DC · cs.IR· cs.LG

Recognition: 3 theorem links

One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving

Amelie Chi Zhou, Shuguang Han, Wenjun Yu

Pith reviewed 2026-05-08 17:42 UTC · model grok-4.3

classification 💻 cs.DC cs.IRcs.LG
keywords generative recommenderHBM partitioningKV cacheembedding cacheadaptive allocationPPO controllerrequest schedulingGPU inference serving
0
0 comments X

The pith

HELM dynamically partitions GPU HBM between embedding and KV caches to cut P99 latency 24-38% while meeting 93-99% SLOs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Generative recommender inference requires both embedding hot caches and KV caches to reside in limited GPU high-bandwidth memory, yet the allocation ratio that minimizes latency shifts by as much as 0.35 across steady, trend, and burst regimes. Static partitions therefore leave 20-30% of possible latency improvement unrealized, while naive reallocation moves data over the PCIe bus on the critical path and triggers SLO violations. HELM solves this by running a low-latency PPO controller that continuously adjusts the split and by routing requests according to current KV residency, embedding locality, and node load. On three production-scale datasets across a 32-node A100 cluster the system delivers the reported gains without throughput loss.

Core claim

HELM jointly manages HBM allocation and request routing at runtime through two components: Adaptive Memory Allocation, a three-layer PPO controller (frozen base policy, online residual adapter, burst-aware recovery) that reaches 32 μs decisions and stays within 0.024-0.029 of the offline-optimal EMB-KV ratio, plus EMB-KV-Aware Scheduling that routes requests while considering KV residency, embedding locality, and node load. This combination reduces P99 latency 24-38% versus the best static policy and achieves 93.5-99.6% SLO satisfaction across Steady, Trend, and Burst workloads.

What carries the argument

The three-layer PPO controller (frozen base policy, online residual adapter, and burst-aware recovery controller) that tracks the optimal EMB-KV memory ratio at 32 μs latency without introducing H2D refill traffic.

If this is right

  • Joint allocation and routing decisions eliminate routing inefficiencies that appear when caches are managed separately under heterogeneous node states.
  • The controller maintains throughput parity while cutting tail latency because it avoids PCIe data movement on the serving path.
  • Production-scale evaluation on three datasets over 32 A100 nodes confirms the gains hold across steady, trending, and bursty request patterns.
  • The same framework applies to any serving system in which two or more cache types compete for a single fast memory pool.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same low-overhead controller pattern could be reused for other competing GPU memory structures such as activation caches and parameter shards in large-model inference.
  • Because the optimal ratio changes with workload statistics, similar adaptive partitioning may become necessary as model sizes and context lengths continue to grow.
  • Extending the burst-aware recovery layer to incorporate short-term traffic predictions could further shrink the residual tracking error below 0.02.

Load-bearing premise

The three-layer PPO controller can track the shifting optimal EMB-KV ratio within 0.024-0.029 while keeping 32 μs decision latency and avoiding H2D refill traffic on the critical path under real production workload shifts.

What would settle it

Running the same Steady, Trend, and Burst workloads with the controller frozen at a static ratio shows P99 latency improvement dropping below 10% or SLO satisfaction falling below 90%.

Figures

Figures reproduced from arXiv: 2605.04450 by Amelie Chi Zhou, Shuguang Han, Wenjun Yu.

Figure 1
Figure 1. Figure 1: HSTU inference latency under different settings view at source ↗
Figure 2
Figure 2. Figure 2: HSTU Model Structure II. BACKGROUND AND MOTIVATION A. Generative Recommender Systems Generative Recommenders (GRs) are transformer-based rec￾ommendation models that predict future user interactions from long historical behavior sequences. In this paper, we focus on Meta’s HSTU [53] as a representative GR architecture because it has become a common design point in industrial generative recommendation system… view at source ↗
Figure 3
Figure 3. Figure 3: HSTU Online Inference Serving Architecture view at source ↗
Figure 4
Figure 4. Figure 4: Normalized latency vs. EMB–KV allocation ratio. view at source ↗
Figure 5
Figure 5. Figure 5: Rapid production fluc￾tuations in hot-user req. share and mean seq. length EMB Aware KV Aware GR Aware 0 20 40 60 80 Cache hit rate (%) EMB hit rate KV hit rate P99 latency 69% 42% 66% 4% 29% 24% 20 40 60 80 P99 latency (ms) view at source ↗
Figure 7
Figure 7. Figure 7: Overall architecture of HLEM locality and cache residency on each node. The resulting allocation ratio α is applied incrementally by the memory manager to avoid interfering with latency-critical inference. At the cluster level, the global request router leverages refreshed cache states, node load, and per-node α to estimate the end￾to-end serving cost of each request, while accounting for cache heterogenei… view at source ↗
Figure 9
Figure 9. Figure 9: HLEM overlaps EMB HTOD and throttles back￾ground reallocation traffic EMB Aware KV Aware GR Aware 0.0 0.2 0.4 0.6 Routing Score (a.u.) 0.29 0.38 0.47 Load Score KV Hit Score EMB Hit Score view at source ↗
Figure 11
Figure 11. Figure 11: Overall latency P99 of three methods across three view at source ↗
Figure 14
Figure 14. Figure 14: Allocator-internal ablation study TABLE III: Comparison of α control strategies. Dec. Time: time to produce one allocation decision; Opt. Gap: mean devi￾ation from offline-optimal α ∗ across three workload scenarios. Method Dec. Time Opt. Gap Steady Trend Burst Static-α=0.1 0 s 0.292 0.341 0.331 Static-α=0.3 0.092 0.141 0.131 Static-α=0.5 0.108 0.061 0.072 Static-α=0.7 0.308 0.259 0.269 Static-α=0.9 0.508… view at source ↗
Figure 13
Figure 13. Figure 13: System-level ablation of HLEM 99.4%. This is expected, since a well-tuned fixed allocation can remain close to optimal when the workload remains stable. Under Trend, however, the gap widens: KV-EMB-Opt drops to 83–91% as the hot-user ratio drifts away from its design point, whereas HLEM maintains at least 98.9% by tracking the changing EMB–KV optimum online. The largest difference appears under Burst traf… view at source ↗
Figure 15
Figure 15. Figure 15: RL-controlled αRL versus offline optimal α ∗ over a four-hour window (Amazon-Books). Scatter points denote hot-user request regimes: Steady, Trend, and Burst view at source ↗
Figure 16
Figure 16. Figure 16: Average αRL under varying cluster scale and hot￾request share. Color denotes regime: All-EMB, Split, All-KV. gaps to 0.041–0.069, showing that temporal prediction alone does not fully resolve the trade-off between responsiveness and stable long-horizon control. RL-based methods, including PPO-only and SmartAllocator, improve both allocation quality and responsiveness. PPO-only reduces the gap to 0.043, 0.… view at source ↗
read the original abstract

Generative Recommender (GR) inference places embedding hot caches (EMB) and KV caches in direct competition for limited GPU HBM: allocating more memory to one improves its efficiency but degrades the other. Existing systems optimize them in isolation, overlooking that the optimal EMB-KV allocation ratio can shift by up to 0.35 across workload regimes, leaving 20-30\% latency improvement unrealized. While online reallocation is required to close this gap, naive approaches introduce H2D refill traffic on the critical path, causing P99 SLO violations. To address this, we present HELM, which jointly manages HBM allocation and request routing at runtime through two key components: (1) Adaptive Memory Allocation, a three-layer PPO-based controller (frozen base policy, online residual adapter, and burst-aware recovery controller) that achieves $32\,\mathrm{\mu s}$ decision latency while staying within 0.024-0.029 of the offline-optimal ratio; and (2) EMB-KV-Aware Scheduling, which routes requests by jointly considering KV residency, embedding locality, and node load to avoid routing inefficiencies under heterogeneous allocations. Evaluations on three production-scale datasets over a 32-node A100 cluster show that HELM reduces P99 latency by 24-38\% over the best static policy and achieves 93.5-99.6\% SLO satisfaction across Steady, Trend, and Burst workloads, significantly outperforming state-of-the-art baselines without sacrificing throughput.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents HELM, a system for generative recommender serving that jointly manages HBM allocation between embedding (EMB) caches and KV caches via a three-layer PPO controller (frozen base, online residual adapter, burst-aware recovery) achieving 32 μs decisions and 0.024-0.029 deviation from offline-optimal ratios, combined with EMB-KV-aware request scheduling. On three production datasets over a 32-node A100 cluster, it reports 24-38% P99 latency reduction versus the best static policy and 93.5-99.6% SLO satisfaction across Steady/Trend/Burst workloads, outperforming baselines without throughput loss.

Significance. If the adaptive control and scheduling deliver the claimed overhead-free tracking, the work addresses a genuine and previously under-optimized tension in HBM resource allocation for large-scale GR inference, where optimal EMB-KV ratios can shift by up to 0.35. The production-scale empirical evaluation on real hardware provides concrete, falsifiable latency and SLO numbers that could influence practical serving systems.

major comments (3)
  1. [Abstract and evaluation] Abstract and evaluation sections: the headline 24-38% P99 reduction and 93.5-99.6% SLO claims rest on the three-layer PPO controller tracking the optimal EMB-KV ratio within 0.024-0.029 at 32 μs latency without introducing H2D refill traffic. No histograms of per-decision latency, time-series of controller error, or ablation isolating the residual adapter and burst-recovery components are provided, making it impossible to confirm that the observed gains are attributable to adaptive partitioning rather than other system factors.
  2. [Evaluation] Evaluation on 32-node A100 cluster: the reported results lack error bars, detailed characterization of how the Steady/Trend/Burst traces induce ratio shifts, and implementation specifics for the static-policy and SOTA baselines. This weakens verification of the cross-workload robustness claims.
  3. [EMB-KV-Aware Scheduling] EMB-KV-Aware Scheduling description: it is stated that the policy routes requests considering KV residency, embedding locality, and node load to avoid inefficiencies under heterogeneous allocations, but no quantitative breakdown of refill volume or critical-path H2D traffic under the measured controller deviations is given, leaving the central 'no refill on critical path' assumption unverified.
minor comments (2)
  1. [Abstract] The abstract and text would benefit from explicit workload characterization (e.g., request-rate distributions or embedding-access patterns) to contextualize the 0.35 ratio-shift claim.
  2. [Adaptive Memory Allocation] Notation for the three-layer controller components is introduced without a diagram or pseudocode; a small figure would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight opportunities to strengthen the empirical support for our claims regarding the three-layer PPO controller and scheduling policy. We will incorporate additional visualizations, error bars, workload characterizations, and quantitative breakdowns in the revised manuscript. Below we respond point-by-point to each major comment.

read point-by-point responses
  1. Referee: [Abstract and evaluation] Abstract and evaluation sections: the headline 24-38% P99 reduction and 93.5-99.6% SLO claims rest on the three-layer PPO controller tracking the optimal EMB-KV ratio within 0.024-0.029 at 32 μs latency without introducing H2D refill traffic. No histograms of per-decision latency, time-series of controller error, or ablation isolating the residual adapter and burst-recovery components are provided, making it impossible to confirm that the observed gains are attributable to adaptive partitioning rather than other system factors.

    Authors: We agree that the current presentation leaves the attribution of gains to the adaptive controller less direct than ideal. In the revision we will add: (1) histograms of per-decision controller latency across all workloads, (2) time-series plots of instantaneous controller error relative to the offline-optimal ratio, and (3) an ablation study that isolates the residual adapter and burst-aware recovery layers. These additions will show that the 0.024-0.029 tracking accuracy is achieved at 32 μs and directly correlates with the reported P99 reductions, while the static baselines exhibit larger deviations and higher latency. revision: yes

  2. Referee: [Evaluation] Evaluation on 32-node A100 cluster: the reported results lack error bars, detailed characterization of how the Steady/Trend/Burst traces induce ratio shifts, and implementation specifics for the static-policy and SOTA baselines. This weakens verification of the cross-workload robustness claims.

    Authors: We will add error bars (standard deviation over five independent runs) to all latency and SLO figures. We will also expand the evaluation section with a characterization subsection that plots the time-varying optimal EMB-KV ratio for each trace and quantifies the magnitude and frequency of shifts (up to 0.35). Finally, we will include a table or appendix entry detailing the exact configurations and hyper-parameters used for the static policies and SOTA baselines, enabling exact reproduction. revision: yes

  3. Referee: [EMB-KV-Aware Scheduling] EMB-KV-Aware Scheduling description: it is stated that the policy routes requests considering KV residency, embedding locality, and node load to avoid inefficiencies under heterogeneous allocations, but no quantitative breakdown of refill volume or critical-path H2D traffic under the measured controller deviations is given, leaving the central 'no refill on critical path' assumption unverified.

    Authors: We will add a new figure and accompanying text that reports measured H2D refill volumes and the fraction of traffic on the critical path, broken down by workload and under the observed controller deviations of 0.024-0.029. The data will demonstrate that the EMB-KV-aware scheduler keeps critical-path refills below 1% of total H2D traffic, confirming that the small tracking error does not translate into SLO violations. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical system evaluation with direct measurements

full rationale

The paper presents HELM as an empirical systems contribution: a three-layer PPO controller for runtime HBM partitioning between EMB and KV caches, combined with EMB-KV-aware scheduling. Central claims (24-38% P99 latency reduction, 93.5-99.6% SLO satisfaction) rest on cluster measurements across Steady/Trend/Burst workloads on three production datasets. No derivation chain, equations, or first-principles predictions exist that reduce to fitted inputs or self-citations by construction. The controller's tracking bounds (0.024-0.029) and 32 μs latency are stated as design targets and validated via end-to-end runs rather than derived tautologically. Any self-citations (if present in full text) are not load-bearing for the reported results, which are externally falsifiable via the described 32-node A100 experiments.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim depends on the validity of observed workload shifts and the effectiveness of the learned controller; the PPO components introduce fitted parameters while the shift magnitude is treated as an observed domain fact.

free parameters (1)
  • PPO residual adapter and recovery controller parameters
    The online adapter and burst-aware recovery are trained or tuned components whose exact values are not stated but are required for the 32 μs decision latency and 0.024-0.029 closeness claims.
axioms (1)
  • domain assumption Optimal EMB-KV allocation ratio shifts by up to 0.35 across workload regimes
    Presented as a key motivating observation that existing isolated optimizations leave 20-30% latency improvement unrealized.
invented entities (2)
  • Three-layer PPO controller (frozen base + online residual adapter + burst-aware recovery) no independent evidence
    purpose: Achieve low-latency adaptive HBM allocation
    New controller architecture introduced to manage the allocation without critical-path overhead.
  • EMB-KV-Aware Scheduling policy no independent evidence
    purpose: Route requests considering KV residency, embedding locality, and node load
    New scheduling component to avoid inefficiencies under heterogeneous allocations.

pith-pipeline@v0.9.0 · 5581 in / 1457 out tokens · 88922 ms · 2026-05-08T17:42:00.026841+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 44 canonical work pages · 6 internal anchors

  1. [1]

    Avazu click-through rate prediction,

    “Avazu click-through rate prediction,” https://www.kaggle.com/c/avazu- ctr-prediction

  2. [2]

    Kdd cup 2012 track 2: Predicting clicks on advertisements,

    “Kdd cup 2012 track 2: Predicting clicks on advertisements,” https:// www.kdd.org/kdd-cup/view/kdd-cup-2012-track-2

  3. [3]

    Meta synthetic datasets for embedding lookup workloads,

    “Meta synthetic datasets for embedding lookup workloads,” https:// github.com/facebookresearch/dlrm datasets

  4. [4]

    Taobao user behavior dataset,

    “Taobao user behavior dataset,” https://tianchi.aliyun.com/dataset/ dataDetail?dataId=649

  5. [5]

    K. J. ˚Astr¨om and T. H ¨agglund,PID Controllers: Theory, Design and Tuning, 2nd ed. Research Triangle Park, NC: Instrument Society of America, 1995. [Online]. Available: https://portal.research.lu.se/en/ publications/pid-controllers-theory-design-and-tuning

  6. [6]

    Longer: Scaling up long sequence modeling in industrial recommenders,

    Z. Chai, Q. Ren, X. Xiao, H. Yang, B. Han, S. Zhang, D. Chen, H. Lu, W. Zhao, L. Yuet al., “Longer: Scaling up long sequence modeling in industrial recommenders,” inProceedings of the Nineteenth ACM Conference on Recommender Systems, 2025, pp. 247–256. [Online]. Available: https://doi.org/10.1145/3705328.3748065

  7. [7]

    Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

    P. Chen, J. Zhang, H. Zhao, Y . Zhang, J. Yu, X. Tang, Y . Wang, H. Li, J. Zou, G. Xionget al., “Toward robust and efficient ml-based gpu caching for modern inference,”arXiv preprint arXiv:2509.20979, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2509.20979

  8. [8]

    Drlpart: A deep reinforcement learning framework for optimally efficient and robust resource partitioning on commodity servers,

    R. Chen, J. Wu, H. Shi, Y . Li, X. Liu, and G. Wang, “Drlpart: A deep reinforcement learning framework for optimally efficient and robust resource partitioning on commodity servers,” inProceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing, 2021, pp. 175–188. [Online]. Available: https://doi.org/10.1145/3431...

  9. [9]

    Chen and C

    T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” inProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794. [Online]. Available: https://doi.org/10.1145/2939672.2939785

  10. [10]

    Meta’s second generation ai chip: Model-chip co-design and productionization experiences,

    J. Coburn, C. Tang, S. A. Asal, N. Agrawal, R. Chinta, H. Dixit, B. Dodds, S. Dwarakapuram, A. Firoozshahian, C. Gaoet al., “Meta’s second generation ai chip: Model-chip co-design and productionization experiences,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 1689–1702. [Online]. Available: https://doi.org/...

  11. [11]

    Criteo 1tb click logs dataset,

    Criteo AI Lab, “Criteo 1tb click logs dataset,” https://ailab.criteo.com/ criteo-1tb-click-logs-dataset/, accessed: 2026

  12. [12]

    InProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis(Atlanta, GA, USA)(SC ’24)

    H. Feng, B. Zhang, F. Ye, M. Si, C.-H. Chu, J. Tian, C. Yin, S. Deng, Y . Hao, P. Balajiet al., “Accelerating communication in deep learning recommendation model training with dual-level adaptive lossy compression,” inSC24: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2024, pp. 1–16. [Online]. Available:...

  13. [13]

    Fast state restoration in llm serving with hcache,

    S. Gao, Y . Chen, and J. Shu, “Fast state restoration in llm serving with hcache,” inProceedings of the Twentieth European Conference on Computer Systems, 2025, pp. 128–143. [Online]. Available: https://www.doi.org/10.1145/3689031.3696072

  14. [14]

    Make it long, keep it fast: End-to-end 10k-sequence modeling at billion scale on douyin,

    L. Guan, J.-Q. Yang, Z. Zhao, B. Zhang, B. Sun, X. Luo, J. Ni, X. Li, Y . Qi, Z. Fanet al., “Make it long, keep it fast: End-to-end 10k-sequence modeling at billion scale on douyin,”arXiv preprint arXiv:2511.06077,

  15. [15]
  16. [16]

    Flame: A serving system optimized for large-scale generative recommendation with efficiency,

    X. Guo, B. Huang, X. Wu, G. Wu, F. Li, S. Wang, Q. Xiao, C. Luo, and Y . Li, “Flame: A serving system optimized for large-scale generative recommendation with efficiency,”arXiv preprint arXiv:2509.22681,

  17. [17]
  18. [18]

    The architectural implications of facebook’s dnn-based personalized recommendation,

    U. Gupta, C.-J. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks, B. Cottel, K. Hazelwood, M. Hempstead, B. Jiaet al., “The architectural implications of facebook’s dnn-based personalized recommendation,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 488–501. [Online]. Available: https://doi.org/10.1109...

  19. [19]

    Gurobi Optimizer Reference Man- ual,

    Gurobi Optimization, LLC, “Gurobi Optimizer Reference Man- ual,” https://www.gurobi.com/documentation/current/refman/index.html, 2021

  20. [20]

    Mtgr: Industrial-scale generative recommendation framework in meituan,

    R. Han, B. Yin, S. Chen, H. Jiang, F. Jiang, X. Li, C. Ma, M. Huang, X. Li, C. Jinget al., “Mtgr: Industrial-scale generative recommendation framework in meituan,” inProceedings of the 34th ACM International Conference on Information and Knowledge Management, 2025, pp. 5731–5738. [Online]. Available: https: //www.doi.org/10.1145/3746252.3761565

  21. [21]

    Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

    Y . Hou, J. Li, Z. He, A. Yan, X. Chen, and J. McAuley, “Bridging language and items for retrieval and recommendation,” arXiv preprint arXiv:2403.03952, 2024. [Online]. Available: https: //doi.org/10.48550/arXiv.2403.03952

  22. [22]

    Towards large-scale generative ranking.arXiv preprint arXiv:2505.04180, 2025

    Y . Huang, Y . Chen, X. Cao, R. Yang, M. Qi, Y . Zhu, Q. Han, Y . Liu, Z. Liu, X. Yaoet al., “Towards large-scale generative ranking,”arXiv preprint arXiv:2505.04180, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2505.04180

  23. [23]

    Fbgemm: Enabling high-performance low-precision deep learning inference

    D. Khudia, J. Huang, P. Basu, S. Deng, H. Liu, J. Park, and M. Smelyanskiy, “Fbgemm: Enabling high-performance low-precision deep learning inference,”arXiv preprint arXiv:2101.05615, 2021. [Online]. Available: https://doi.org/10.48550/arXiv.2101.05615

  24. [24]

    A characterization of generative recommendation models: Study of hierarchical sequential transduction unit,

    T. Kim, Y . Lee, J. Lim, and M. Rhu, “A characterization of generative recommendation models: Study of hierarchical sequential transduction unit,”IEEE Computer Architecture Letters, 2025. [Online]. Available: https://doi.org/10.1109/LCA.2025.3546811

  25. [25]

    and Zhang, Hao and Stoica, Ion , booktitle =

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th symposium on operating systems principles, 2023, pp. 611–626. [Online]. Available: https://dl.acm.org/doi/10.1145/3600006.3613165

  26. [26]

    Embedding samples dispatching for recommendation model training in edge environments,

    G. Li, H. Tan, C. Zhang, H. Ni, Z. Wang, X. Zhang, Y . Xu, and H. Tian, “Embedding samples dispatching for recommendation model training in edge environments,”arXiv preprint arXiv:2512.21615, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2512.21615

  27. [27]

    Persia: An open, hybrid system scaling 11 deep learning-based recommenders up to 100 trillion parameters,

    X. Lian, B. Yuan, X. Zhu, Y . Wang, Y . He, H. Wu, L. Sun, H. Lyu, C. Liu, X. Donget al., “Persia: An open, hybrid system scaling 11 deep learning-based recommenders up to 100 trillion parameters,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 3288–3298. [Online]. Available: https://doi.org/10.1145/3534...

  28. [28]

    Tbgrecall: A generative retrieval model for e-commerce recommendation scenarios,

    Z. Liang, C. Wu, D. Huang, W. Sun, Z. Wang, Y . Yan, J. Wu, Y . Jiang, B. Zheng, K. Chenet al., “Tbgrecall: A generative retrieval model for e-commerce recommendation scenarios,” inProceedings of the 34th ACM International Conference on Information and Knowledge Management, 2025, pp. 5863–5870. [Online]. Available: https://dl.acm.org/doi/10.1145/3746252.3761557

  29. [29]

    Near-memory computing with compressed embedding table for personalized recommendation,

    J. Lim, Y . G. Kim, S. W. Chung, F. Koushanfar, and J. Kong, “Near-memory computing with compressed embedding table for personalized recommendation,”IEEE Transactions on Emerging Topics in Computing, vol. 12, pp. 938–951, 2024. [Online]. Available: https://www.doi.org/10.1109/tetc.2023.3345870

  30. [30]

    Decentralized task offloading in edge computing: An offline-to-online reinforcement learning approach,

    H. Lin, L. Yang, H. Guo, and J. Cao, “Decentralized task offloading in edge computing: An offline-to-online reinforcement learning approach,” IEEE Transactions on Computers, vol. 73, pp. 1603–1615, 2024. [Online]. Available: https://www.doi.org/10.1109/tc.2024.3377912

  31. [31]

    DLRMv3: Generative recommendation benchmark in MLPerf in- ference,

    L. Ma, H. Kassa, Y . Lee, C. Liu, Z. Kong, Z. Jiang, P. Palangappa, R. Brugarolas, X. Cao, M. Hodak, C.-J. Wu, X. Liu, and J. Zhai, “DLRMv3: Generative recommendation benchmark in MLPerf in- ference,” https://mlcommons.org/2026/02/dlrmv3-inference-meta/, Feb. 2026, mLCommons Technical Report

  32. [32]

    Harnessing machine learning in dynamic thermal management in embedded cpu-gpu platforms,

    S. Maity, A. Majumder, R. Roy, A. R. Hota, and S. Dey, “Harnessing machine learning in dynamic thermal management in embedded cpu-gpu platforms,”ACM Transactions on Design Automation of Electronic Systems, vol. 30, pp. 1–32, 2024. [Online]. Available: https://www.doi.org/10.1145/3708890

  33. [33]

    Software-hardware co-design for fast and scalable training of deep learning recommendation models,

    D. Mudigere, Y . Hao, J. Huang, Z. Jia, A. Tulloch, S. Sridharan, X. Liu, M. Ozdal, J. Nie, J. Parket al., “Software-hardware co-design for fast and scalable training of deep learning recommendation models,” inProceedings of the 49th Annual International Symposium on Computer Architecture, 2022, pp. 993–1011. [Online]. Available: https://doi.org/10.1145/3...

  34. [34]

    Azzolini, et al

    M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. Azzoliniet al., “Deep learning recommendation model for personalization and recommendation systems,”arXiv preprint arXiv:1906.00091, 2019. [Online]. Available: https://doi.org/10.48550/arXiv.1906.00091

  35. [35]

    Nvidia collective communications library (nccl),

    NVIDIA Corporation, “Nvidia collective communications library (nccl),” https://developer.nvidia.com/nccl, 2017

  36. [36]

    Nvidia multi-process service (mps) overview,

    ——, “Nvidia multi-process service (mps) overview,” https://docs.nvidia. com/deploy/mps/index.html, 2023, accessed: 2026-03-26

  37. [37]

    Job placement using reinforcement learning in gpu virtualization environment,

    J. Oh and Y . Kim, “Job placement using reinforcement learning in gpu virtualization environment,”Cluster Computing, vol. 23, no. 3, pp. 2219–2234, Sep. 2020. [Online]. Available: https: //doi.org/10.1007/s10586-019-03044-7

  38. [39]

    Mooncake: Trading more storage for less computation—a{KVCache-centric}architecture for serving {LLM}chatbot,

    R. Qin, Z. Li, W. He, J. Cui, F. Ren, M. Zhang, Y . Wu, W. Zheng, and X. Xu, “Mooncake: Trading more storage for less computation—a{KVCache-centric}architecture for serving {LLM}chatbot,” in23rd USENIX conference on file and storage technologies (FAST 25), 2025, pp. 155–170. [Online]. Available: https://www.usenix.org/conference/fast25/presentation/qin

  39. [40]

    Dynamollm: Designing llm inference clusters for performance and energy efficiency,

    J. Ren, B. Ma, S. Yang, B. Francis, E. K. Ardestani, M. Si, and D. Li, “Machine learning-guided memory optimization for dlrm inference on tiered memory,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 1631–1647. [Online]. Available: https://doi.org/10.1109/HPCA61900. 2025.00121

  40. [41]

    Hierarchical resource partitioning on modern gpus: A reinforcement learning approach,

    U. Saroliya, E. Arima, D. Liu, and M. Schulz, “Hierarchical resource partitioning on modern gpus: A reinforcement learning approach,” in2023 IEEE International Conference on Cluster Computing (CLUSTER), 2023, pp. 185–196. [Online]. Available: https://www.doi.org/10.1109/cluster52292.2023.00023

  41. [42]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347,

  42. [43]

    Proximal Policy Optimization Algorithms

    [Online]. Available: https://doi.org/10.48550/arXiv.1707.06347

  43. [44]

    Taking the human out of the loop: A review of bayesian optimization,

    B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas, “Taking the human out of the loop: A review of bayesian optimization,” Proceedings of the IEEE, vol. 104, no. 1, pp. 148–175, 2015. [Online]. Available: https://doi.org/10.1109/JPROC.2015.2494218

  44. [45]

    BAT: Efficient generative recommender serving with bipartite attention,

    J. Sun, S. Wang, Z. Zhang, Z. Liu, Y . Xu, P. Sun, B. Zhao, B. He, F. Wu, and Z. Wang, “BAT: Efficient generative recommender serving with bipartite attention,” inProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2026. [Online]. Available: https://doi.org/10.1145/3779212.3790131

  45. [46]

    Fleetio: Managing multi-tenant cloud storage with multi-agent reinforcement learning,

    J. Sun, B. Reidys, D. Li, J. Chang, M. Snir, and J. Huang, “Fleetio: Managing multi-tenant cloud storage with multi-agent reinforcement learning,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2025, pp. 478–492. [Online]. Available: https://doi.org/10.1145/3669940.3707229

  46. [47]

    xgr: Efficient generative recommendation serving at scale,

    Q. Sun, T. Liu, S. Zhang, S. Wu, P. Yang, H. Liang, M. Li, X. Ma, Z. Liang, Z. Renet al., “xgr: Efficient generative recommendation serving at scale,”arXiv preprint arXiv:2512.11529, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2512.11529

  47. [48]

    Dynamic resource allocation of virtualized gpus and cpus for scalable ai workloads in containerized environments,

    S. Udkar, V . Gupta, M. Reddy, and M. Vaithianathan, “Dynamic resource allocation of virtualized gpus and cpus for scalable ai workloads in containerized environments,” in2025 5th International Conference on Expert Clouds and Applications (ICOECA). IEEE, 2025, pp. 432–437. [Online]. Available: https://www.doi.org/10.1109/ icoeca66273.2025.00080

  48. [49]

    Relaygr: Scaling long-sequence generative recommendation via cross-stage relay-race inference,

    J. Wang, H. Chai, Y . Zhang, Z. Zhou, W. Guo, X. Yang, Q. Tang, B. Pan, J. Zhu, K. Chenget al., “Relaygr: Scaling long-sequence generative recommendation via cross-stage relay-race inference,”arXiv preprint arXiv:2601.01712, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2601.01712

  49. [50]

    Multi- dimensional resource allocation in distributed data centers using deep reinforcement learning,

    W. Wei, H. Gu, K. Wang, J. Li, X. Zhang, and N. Wang, “Multi- dimensional resource allocation in distributed data centers using deep reinforcement learning,”IEEE Transactions on Network and Service Management, vol. 20, pp. 1817–1829, 2023. [Online]. Available: https://www.doi.org/10.1109/tnsm.2022.3213575

  50. [51]

    Gpu-disaggregated serving for deep learning recommendation models at scale,

    L. Yang, Y . Wang, Y . Yu, Q. Weng, J. Dong, K. Liu, C. Zhang, Y . Zi, H. Li, Z. Zhanget al., “Gpu-disaggregated serving for deep learning recommendation models at scale,” in22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), 2025, pp. 847–863. [Online]. Available: https://www.usenix.org/conference/ nsdi25/presentation/yang

  51. [52]

    Orca: A distributed serving system for transformer-based generative models,

    G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for transformer-based generative models,” in16th USENIX symposium on operating systems design and implementation (OSDI 22), 2022, pp. 521–538. [Online]. Available: https://www.usenix.org/conference/osdi22/presentation/yu

  52. [53]

    Near-zero-overhead freshness for recommendation systems via inference-side model updates,

    W. Yu, S. Chen, C. Chen, and A. C. Zhou, “Near-zero-overhead freshness for recommendation systems via inference-side model updates,” inProceedings of the 32nd IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2026. [Online]. Available: https://doi.org/10.1109/HPCA68181.2026.11408549

  53. [54]

    Neural Computation , volume=

    Y . Yu, X. Si, C. Hu, and J. Zhang, “A review of recurrent neural networks: Lstm cells and network architectures,”Neural computation, vol. 31, no. 7, pp. 1235–1270, 2019. [Online]. Available: https://www.doi.org/10.1162/neco a 01199

  54. [55]

    Autoshard: Automated embedding table sharding for recommender systems,

    D. Zha, L. Feng, B. Bhushanam, D. Choudhary, J. Nie, Y . Tian, J. Chae, Y . Ma, A. Kejariwal, and X. Hu, “Autoshard: Automated embedding table sharding for recommender systems,” inProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 4461–4471. [Online]. Available: https://www.doi.org/10.1145/3534678.3539034

  55. [56]

    Actions speak louder than words: trillion-parameter sequential transducers for generative recommendations,

    J. Zhai, L. Liao, X. Liu, Y . Wang, R. Li, X. Cao, L. Gao, Z. Gong, F. Gu, J. Heet al., “Actions speak louder than words: trillion-parameter sequential transducers for generative recommendations,” inProceedings of the 41st International Conference on Machine Learning, 2024, pp. 58 484–58 509. [Online]. Available: https://dl.acm.org/doi/10.5555/ 3692070.3694484

  56. [57]

    Recd: Deduplication for end-to-end deep learning recommendation model training infrastructure,

    M. Zhao, D. Choudhary, D. Tyagi, A. Somani, M. Kaplan, S.-H. Lin, S. Pumma, J. Park, A. Basant, N. Agarwal et al., “Recd: Deduplication for end-to-end deep learning recommendation model training infrastructure,”Proceedings of Machine Learning and Systems, vol. 5, pp. 754–767, 2023. [On- 12 line]. Available: https://proceedings.mlsys.org/paper files/paper/...

  57. [58]

    Onerec-v2 technical report.arXiv preprint arXiv:2508.20900, 2025

    G. Zhou, H. Hu, H. Cheng, H. Wang, J. Deng, J. Zhang, K. Cai, L. Ren, L. Ren, L. Yuet al., “Onerec-v2 technical report,”arXiv preprint arXiv:2508.20900, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2508.20900

  58. [59]

    Gems: Breaking the long-sequence barrier in generative recommendation with a multi-stream decoder,

    Y . Zhou, C. Guo, K. Cai, J. Liu, Q. Luo, R. Tang, H. Li, K. Gai, and G. Zhou, “Gems: Breaking the long-sequence barrier in generative recommendation with a multi-stream decoder,”arXiv preprint arXiv:2602.13631, 2026. [Online]. Available: https://doi.org/ 10.48550/arXiv.2602.13631 13