arxiv: 2605.04450 · v1 · submitted 2026-05-06 · 💻 cs.DC · cs.IR· cs.LG

Recognition: 3 theorem links

One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving

Amelie Chi Zhou, Shuguang Han, Wenjun Yu

Pith reviewed 2026-05-08 17:42 UTC · model grok-4.3

classification 💻 cs.DC cs.IRcs.LG

keywords generative recommenderHBM partitioningKV cacheembedding cacheadaptive allocationPPO controllerrequest schedulingGPU inference serving

0 comments

The pith

HELM dynamically partitions GPU HBM between embedding and KV caches to cut P99 latency 24-38% while meeting 93-99% SLOs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Generative recommender inference requires both embedding hot caches and KV caches to reside in limited GPU high-bandwidth memory, yet the allocation ratio that minimizes latency shifts by as much as 0.35 across steady, trend, and burst regimes. Static partitions therefore leave 20-30% of possible latency improvement unrealized, while naive reallocation moves data over the PCIe bus on the critical path and triggers SLO violations. HELM solves this by running a low-latency PPO controller that continuously adjusts the split and by routing requests according to current KV residency, embedding locality, and node load. On three production-scale datasets across a 32-node A100 cluster the system delivers the reported gains without throughput loss.

Core claim

HELM jointly manages HBM allocation and request routing at runtime through two components: Adaptive Memory Allocation, a three-layer PPO controller (frozen base policy, online residual adapter, burst-aware recovery) that reaches 32 μs decisions and stays within 0.024-0.029 of the offline-optimal EMB-KV ratio, plus EMB-KV-Aware Scheduling that routes requests while considering KV residency, embedding locality, and node load. This combination reduces P99 latency 24-38% versus the best static policy and achieves 93.5-99.6% SLO satisfaction across Steady, Trend, and Burst workloads.

What carries the argument

The three-layer PPO controller (frozen base policy, online residual adapter, and burst-aware recovery controller) that tracks the optimal EMB-KV memory ratio at 32 μs latency without introducing H2D refill traffic.

If this is right

Joint allocation and routing decisions eliminate routing inefficiencies that appear when caches are managed separately under heterogeneous node states.
The controller maintains throughput parity while cutting tail latency because it avoids PCIe data movement on the serving path.
Production-scale evaluation on three datasets over 32 A100 nodes confirms the gains hold across steady, trending, and bursty request patterns.
The same framework applies to any serving system in which two or more cache types compete for a single fast memory pool.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same low-overhead controller pattern could be reused for other competing GPU memory structures such as activation caches and parameter shards in large-model inference.
Because the optimal ratio changes with workload statistics, similar adaptive partitioning may become necessary as model sizes and context lengths continue to grow.
Extending the burst-aware recovery layer to incorporate short-term traffic predictions could further shrink the residual tracking error below 0.02.

Load-bearing premise

The three-layer PPO controller can track the shifting optimal EMB-KV ratio within 0.024-0.029 while keeping 32 μs decision latency and avoiding H2D refill traffic on the critical path under real production workload shifts.

What would settle it

Running the same Steady, Trend, and Burst workloads with the controller frozen at a static ratio shows P99 latency improvement dropping below 10% or SLO satisfaction falling below 90%.

Figures

Figures reproduced from arXiv: 2605.04450 by Amelie Chi Zhou, Shuguang Han, Wenjun Yu.

**Figure 1.** Figure 1: HSTU inference latency under different settings view at source ↗

**Figure 2.** Figure 2: HSTU Model Structure II. BACKGROUND AND MOTIVATION A. Generative Recommender Systems Generative Recommenders (GRs) are transformer-based recommendation models that predict future user interactions from long historical behavior sequences. In this paper, we focus on Meta’s HSTU [53] as a representative GR architecture because it has become a common design point in industrial generative recommendation system… view at source ↗

**Figure 3.** Figure 3: HSTU Online Inference Serving Architecture view at source ↗

**Figure 4.** Figure 4: Normalized latency vs. EMB–KV allocation ratio. view at source ↗

**Figure 5.** Figure 5: Rapid production fluctuations in hot-user req. share and mean seq. length EMB Aware KV Aware GR Aware 0 20 40 60 80 Cache hit rate (%) EMB hit rate KV hit rate P99 latency 69% 42% 66% 4% 29% 24% 20 40 60 80 P99 latency (ms) view at source ↗

**Figure 7.** Figure 7: Overall architecture of HLEM locality and cache residency on each node. The resulting allocation ratio α is applied incrementally by the memory manager to avoid interfering with latency-critical inference. At the cluster level, the global request router leverages refreshed cache states, node load, and per-node α to estimate the endto-end serving cost of each request, while accounting for cache heterogenei… view at source ↗

**Figure 9.** Figure 9: HLEM overlaps EMB HTOD and throttles background reallocation traffic EMB Aware KV Aware GR Aware 0.0 0.2 0.4 0.6 Routing Score (a.u.) 0.29 0.38 0.47 Load Score KV Hit Score EMB Hit Score view at source ↗

**Figure 11.** Figure 11: Overall latency P99 of three methods across three view at source ↗

**Figure 14.** Figure 14: Allocator-internal ablation study TABLE III: Comparison of α control strategies. Dec. Time: time to produce one allocation decision; Opt. Gap: mean deviation from offline-optimal α ∗ across three workload scenarios. Method Dec. Time Opt. Gap Steady Trend Burst Static-α=0.1 0 s 0.292 0.341 0.331 Static-α=0.3 0.092 0.141 0.131 Static-α=0.5 0.108 0.061 0.072 Static-α=0.7 0.308 0.259 0.269 Static-α=0.9 0.508… view at source ↗

**Figure 13.** Figure 13: System-level ablation of HLEM 99.4%. This is expected, since a well-tuned fixed allocation can remain close to optimal when the workload remains stable. Under Trend, however, the gap widens: KV-EMB-Opt drops to 83–91% as the hot-user ratio drifts away from its design point, whereas HLEM maintains at least 98.9% by tracking the changing EMB–KV optimum online. The largest difference appears under Burst traf… view at source ↗

**Figure 15.** Figure 15: RL-controlled αRL versus offline optimal α ∗ over a four-hour window (Amazon-Books). Scatter points denote hot-user request regimes: Steady, Trend, and Burst view at source ↗

**Figure 16.** Figure 16: Average αRL under varying cluster scale and hotrequest share. Color denotes regime: All-EMB, Split, All-KV. gaps to 0.041–0.069, showing that temporal prediction alone does not fully resolve the trade-off between responsiveness and stable long-horizon control. RL-based methods, including PPO-only and SmartAllocator, improve both allocation quality and responsiveness. PPO-only reduces the gap to 0.043, 0.… view at source ↗

read the original abstract

Generative Recommender (GR) inference places embedding hot caches (EMB) and KV caches in direct competition for limited GPU HBM: allocating more memory to one improves its efficiency but degrades the other. Existing systems optimize them in isolation, overlooking that the optimal EMB-KV allocation ratio can shift by up to 0.35 across workload regimes, leaving 20-30\% latency improvement unrealized. While online reallocation is required to close this gap, naive approaches introduce H2D refill traffic on the critical path, causing P99 SLO violations. To address this, we present HELM, which jointly manages HBM allocation and request routing at runtime through two key components: (1) Adaptive Memory Allocation, a three-layer PPO-based controller (frozen base policy, online residual adapter, and burst-aware recovery controller) that achieves $32\,\mathrm{\mu s}$ decision latency while staying within 0.024-0.029 of the offline-optimal ratio; and (2) EMB-KV-Aware Scheduling, which routes requests by jointly considering KV residency, embedding locality, and node load to avoid routing inefficiencies under heterogeneous allocations. Evaluations on three production-scale datasets over a 32-node A100 cluster show that HELM reduces P99 latency by 24-38\% over the best static policy and achieves 93.5-99.6\% SLO satisfaction across Steady, Trend, and Burst workloads, significantly outperforming state-of-the-art baselines without sacrificing throughput.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HELM's three-layer PPO controller for dynamic EMB-KV HBM partitioning plus aware scheduling delivers measurable tail latency wins on production recommender traces, but the end-to-end numbers leave the controller's actual tracking contribution unisolated.

read the letter

The core takeaway is that HELM jointly manages the two competing HBM caches in generative recommender serving instead of tuning them separately. The system adds a three-layer PPO controller (frozen base plus residual adapter and burst recovery) that decides the allocation ratio at runtime, paired with a scheduler that routes requests while looking at KV residency, embedding locality, and load. This combination is the concrete new piece relative to prior isolated cache work, and it targets the fact that the optimal split can shift by 0.35 across workload regimes.

Referee Report

3 major / 2 minor

Summary. The paper presents HELM, a system for generative recommender serving that jointly manages HBM allocation between embedding (EMB) caches and KV caches via a three-layer PPO controller (frozen base, online residual adapter, burst-aware recovery) achieving 32 μs decisions and 0.024-0.029 deviation from offline-optimal ratios, combined with EMB-KV-aware request scheduling. On three production datasets over a 32-node A100 cluster, it reports 24-38% P99 latency reduction versus the best static policy and 93.5-99.6% SLO satisfaction across Steady/Trend/Burst workloads, outperforming baselines without throughput loss.

Significance. If the adaptive control and scheduling deliver the claimed overhead-free tracking, the work addresses a genuine and previously under-optimized tension in HBM resource allocation for large-scale GR inference, where optimal EMB-KV ratios can shift by up to 0.35. The production-scale empirical evaluation on real hardware provides concrete, falsifiable latency and SLO numbers that could influence practical serving systems.

major comments (3)

[Abstract and evaluation] Abstract and evaluation sections: the headline 24-38% P99 reduction and 93.5-99.6% SLO claims rest on the three-layer PPO controller tracking the optimal EMB-KV ratio within 0.024-0.029 at 32 μs latency without introducing H2D refill traffic. No histograms of per-decision latency, time-series of controller error, or ablation isolating the residual adapter and burst-recovery components are provided, making it impossible to confirm that the observed gains are attributable to adaptive partitioning rather than other system factors.
[Evaluation] Evaluation on 32-node A100 cluster: the reported results lack error bars, detailed characterization of how the Steady/Trend/Burst traces induce ratio shifts, and implementation specifics for the static-policy and SOTA baselines. This weakens verification of the cross-workload robustness claims.
[EMB-KV-Aware Scheduling] EMB-KV-Aware Scheduling description: it is stated that the policy routes requests considering KV residency, embedding locality, and node load to avoid inefficiencies under heterogeneous allocations, but no quantitative breakdown of refill volume or critical-path H2D traffic under the measured controller deviations is given, leaving the central 'no refill on critical path' assumption unverified.

minor comments (2)

[Abstract] The abstract and text would benefit from explicit workload characterization (e.g., request-rate distributions or embedding-access patterns) to contextualize the 0.35 ratio-shift claim.
[Adaptive Memory Allocation] Notation for the three-layer controller components is introduced without a diagram or pseudocode; a small figure would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight opportunities to strengthen the empirical support for our claims regarding the three-layer PPO controller and scheduling policy. We will incorporate additional visualizations, error bars, workload characterizations, and quantitative breakdowns in the revised manuscript. Below we respond point-by-point to each major comment.

read point-by-point responses

Referee: [Abstract and evaluation] Abstract and evaluation sections: the headline 24-38% P99 reduction and 93.5-99.6% SLO claims rest on the three-layer PPO controller tracking the optimal EMB-KV ratio within 0.024-0.029 at 32 μs latency without introducing H2D refill traffic. No histograms of per-decision latency, time-series of controller error, or ablation isolating the residual adapter and burst-recovery components are provided, making it impossible to confirm that the observed gains are attributable to adaptive partitioning rather than other system factors.

Authors: We agree that the current presentation leaves the attribution of gains to the adaptive controller less direct than ideal. In the revision we will add: (1) histograms of per-decision controller latency across all workloads, (2) time-series plots of instantaneous controller error relative to the offline-optimal ratio, and (3) an ablation study that isolates the residual adapter and burst-aware recovery layers. These additions will show that the 0.024-0.029 tracking accuracy is achieved at 32 μs and directly correlates with the reported P99 reductions, while the static baselines exhibit larger deviations and higher latency. revision: yes
Referee: [Evaluation] Evaluation on 32-node A100 cluster: the reported results lack error bars, detailed characterization of how the Steady/Trend/Burst traces induce ratio shifts, and implementation specifics for the static-policy and SOTA baselines. This weakens verification of the cross-workload robustness claims.

Authors: We will add error bars (standard deviation over five independent runs) to all latency and SLO figures. We will also expand the evaluation section with a characterization subsection that plots the time-varying optimal EMB-KV ratio for each trace and quantifies the magnitude and frequency of shifts (up to 0.35). Finally, we will include a table or appendix entry detailing the exact configurations and hyper-parameters used for the static policies and SOTA baselines, enabling exact reproduction. revision: yes
Referee: [EMB-KV-Aware Scheduling] EMB-KV-Aware Scheduling description: it is stated that the policy routes requests considering KV residency, embedding locality, and node load to avoid inefficiencies under heterogeneous allocations, but no quantitative breakdown of refill volume or critical-path H2D traffic under the measured controller deviations is given, leaving the central 'no refill on critical path' assumption unverified.

Authors: We will add a new figure and accompanying text that reports measured H2D refill volumes and the fraction of traffic on the critical path, broken down by workload and under the observed controller deviations of 0.024-0.029. The data will demonstrate that the EMB-KV-aware scheduler keeps critical-path refills below 1% of total H2D traffic, confirming that the small tracking error does not translate into SLO violations. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical system evaluation with direct measurements

full rationale

The paper presents HELM as an empirical systems contribution: a three-layer PPO controller for runtime HBM partitioning between EMB and KV caches, combined with EMB-KV-aware scheduling. Central claims (24-38% P99 latency reduction, 93.5-99.6% SLO satisfaction) rest on cluster measurements across Steady/Trend/Burst workloads on three production datasets. No derivation chain, equations, or first-principles predictions exist that reduce to fitted inputs or self-citations by construction. The controller's tracking bounds (0.024-0.029) and 32 μs latency are stated as design targets and validated via end-to-end runs rather than derived tautologically. Any self-citations (if present in full text) are not load-bearing for the reported results, which are externally falsifiable via the described 32-node A100 experiments.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim depends on the validity of observed workload shifts and the effectiveness of the learned controller; the PPO components introduce fitted parameters while the shift magnitude is treated as an observed domain fact.

free parameters (1)

PPO residual adapter and recovery controller parameters
The online adapter and burst-aware recovery are trained or tuned components whose exact values are not stated but are required for the 32 μs decision latency and 0.024-0.029 closeness claims.

axioms (1)

domain assumption Optimal EMB-KV allocation ratio shifts by up to 0.35 across workload regimes
Presented as a key motivating observation that existing isolated optimizations leave 20-30% latency improvement unrealized.

invented entities (2)

Three-layer PPO controller (frozen base + online residual adapter + burst-aware recovery) no independent evidence
purpose: Achieve low-latency adaptive HBM allocation
New controller architecture introduced to manage the allocation without critical-path overhead.
EMB-KV-Aware Scheduling policy no independent evidence
purpose: Route requests considering KV residency, embedding locality, and node load
New scheduling component to avoid inefficiencies under heterogeneous allocations.

pith-pipeline@v0.9.0 · 5581 in / 1457 out tokens · 88922 ms · 2026-05-08T17:42:00.026841+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 44 canonical work pages · 6 internal anchors

[1]

Avazu click-through rate prediction,

“Avazu click-through rate prediction,” https://www.kaggle.com/c/avazu- ctr-prediction
[2]

Kdd cup 2012 track 2: Predicting clicks on advertisements,

“Kdd cup 2012 track 2: Predicting clicks on advertisements,” https:// www.kdd.org/kdd-cup/view/kdd-cup-2012-track-2

2012
[3]

Meta synthetic datasets for embedding lookup workloads,

“Meta synthetic datasets for embedding lookup workloads,” https:// github.com/facebookresearch/dlrm datasets
[4]

Taobao user behavior dataset,

“Taobao user behavior dataset,” https://tianchi.aliyun.com/dataset/ dataDetail?dataId=649
[5]

K. J. ˚Astr¨om and T. H ¨agglund,PID Controllers: Theory, Design and Tuning, 2nd ed. Research Triangle Park, NC: Instrument Society of America, 1995. [Online]. Available: https://portal.research.lu.se/en/ publications/pid-controllers-theory-design-and-tuning

1995
[6]

Longer: Scaling up long sequence modeling in industrial recommenders,

Z. Chai, Q. Ren, X. Xiao, H. Yang, B. Han, S. Zhang, D. Chen, H. Lu, W. Zhao, L. Yuet al., “Longer: Scaling up long sequence modeling in industrial recommenders,” inProceedings of the Nineteenth ACM Conference on Recommender Systems, 2025, pp. 247–256. [Online]. Available: https://doi.org/10.1145/3705328.3748065

work page doi:10.1145/3705328.3748065 2025
[7]

Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

P. Chen, J. Zhang, H. Zhao, Y . Zhang, J. Yu, X. Tang, Y . Wang, H. Li, J. Zou, G. Xionget al., “Toward robust and efficient ml-based gpu caching for modern inference,”arXiv preprint arXiv:2509.20979, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2509.20979

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.20979 2025
[8]

Drlpart: A deep reinforcement learning framework for optimally efficient and robust resource partitioning on commodity servers,

R. Chen, J. Wu, H. Shi, Y . Li, X. Liu, and G. Wang, “Drlpart: A deep reinforcement learning framework for optimally efficient and robust resource partitioning on commodity servers,” inProceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing, 2021, pp. 175–188. [Online]. Available: https://doi.org/10.1145/3431...

work page doi:10.1145/3431379.3460648 2021
[9]

Chen and C

T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” inProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794. [Online]. Available: https://doi.org/10.1145/2939672.2939785

work page doi:10.1145/2939672.2939785 2016
[10]

Meta’s second generation ai chip: Model-chip co-design and productionization experiences,

J. Coburn, C. Tang, S. A. Asal, N. Agrawal, R. Chinta, H. Dixit, B. Dodds, S. Dwarakapuram, A. Firoozshahian, C. Gaoet al., “Meta’s second generation ai chip: Model-chip co-design and productionization experiences,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 1689–1702. [Online]. Available: https://doi.org/...

work page doi:10.1145/3695053.3731409 2025
[11]

Criteo 1tb click logs dataset,

Criteo AI Lab, “Criteo 1tb click logs dataset,” https://ailab.criteo.com/ criteo-1tb-click-logs-dataset/, accessed: 2026

2026
[12]

InProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis(Atlanta, GA, USA)(SC ’24)

H. Feng, B. Zhang, F. Ye, M. Si, C.-H. Chu, J. Tian, C. Yin, S. Deng, Y . Hao, P. Balajiet al., “Accelerating communication in deep learning recommendation model training with dual-level adaptive lossy compression,” inSC24: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2024, pp. 1–16. [Online]. Available:...

work page doi:10.1109/sc41406.2024.00095 2024
[13]

Fast state restoration in llm serving with hcache,

S. Gao, Y . Chen, and J. Shu, “Fast state restoration in llm serving with hcache,” inProceedings of the Twentieth European Conference on Computer Systems, 2025, pp. 128–143. [Online]. Available: https://www.doi.org/10.1145/3689031.3696072

work page doi:10.1145/3689031.3696072 2025
[14]

Make it long, keep it fast: End-to-end 10k-sequence modeling at billion scale on douyin,

L. Guan, J.-Q. Yang, Z. Zhao, B. Zhang, B. Sun, X. Luo, J. Ni, X. Li, Y . Qi, Z. Fanet al., “Make it long, keep it fast: End-to-end 10k-sequence modeling at billion scale on douyin,”arXiv preprint arXiv:2511.06077,

work page internal anchor Pith review arXiv
[15]

Make it long, keep it fast: End-to-end 10k-sequence modeling at billion scale on douyin,

[Online]. Available: https://doi.org/10.48550/arXiv.2511.06077

work page internal anchor Pith review doi:10.48550/arxiv.2511.06077
[16]

Flame: A serving system optimized for large-scale generative recommendation with efficiency,

X. Guo, B. Huang, X. Wu, G. Wu, F. Li, S. Wang, Q. Xiao, C. Luo, and Y . Li, “Flame: A serving system optimized for large-scale generative recommendation with efficiency,”arXiv preprint arXiv:2509.22681,

work page arXiv
[17]

Flame: A serving system optimized for large-scale generative recommendation with efficiency,

[Online]. Available: https://doi.org/10.48550/arXiv.2509.22681

work page doi:10.48550/arxiv.2509.22681
[18]

The architectural implications of facebook’s dnn-based personalized recommendation,

U. Gupta, C.-J. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks, B. Cottel, K. Hazelwood, M. Hempstead, B. Jiaet al., “The architectural implications of facebook’s dnn-based personalized recommendation,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 488–501. [Online]. Available: https://doi.org/10.1109...

work page doi:10.1109/hpca47549.2020.00047 2020
[19]

Gurobi Optimizer Reference Man- ual,

Gurobi Optimization, LLC, “Gurobi Optimizer Reference Man- ual,” https://www.gurobi.com/documentation/current/refman/index.html, 2021

2021
[20]

Mtgr: Industrial-scale generative recommendation framework in meituan,

R. Han, B. Yin, S. Chen, H. Jiang, F. Jiang, X. Li, C. Ma, M. Huang, X. Li, C. Jinget al., “Mtgr: Industrial-scale generative recommendation framework in meituan,” inProceedings of the 34th ACM International Conference on Information and Knowledge Management, 2025, pp. 5731–5738. [Online]. Available: https: //www.doi.org/10.1145/3746252.3761565

work page doi:10.1145/3746252.3761565 2025
[21]

Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

Y . Hou, J. Li, Z. He, A. Yan, X. Chen, and J. McAuley, “Bridging language and items for retrieval and recommendation,” arXiv preprint arXiv:2403.03952, 2024. [Online]. Available: https: //doi.org/10.48550/arXiv.2403.03952

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.03952 2024
[22]

Towards large-scale generative ranking.arXiv preprint arXiv:2505.04180, 2025

Y . Huang, Y . Chen, X. Cao, R. Yang, M. Qi, Y . Zhu, Q. Han, Y . Liu, Z. Liu, X. Yaoet al., “Towards large-scale generative ranking,”arXiv preprint arXiv:2505.04180, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2505.04180

work page doi:10.48550/arxiv.2505.04180 2025
[23]

Fbgemm: Enabling high-performance low-precision deep learning inference

D. Khudia, J. Huang, P. Basu, S. Deng, H. Liu, J. Park, and M. Smelyanskiy, “Fbgemm: Enabling high-performance low-precision deep learning inference,”arXiv preprint arXiv:2101.05615, 2021. [Online]. Available: https://doi.org/10.48550/arXiv.2101.05615

work page doi:10.48550/arxiv.2101.05615 2021
[24]

A characterization of generative recommendation models: Study of hierarchical sequential transduction unit,

T. Kim, Y . Lee, J. Lim, and M. Rhu, “A characterization of generative recommendation models: Study of hierarchical sequential transduction unit,”IEEE Computer Architecture Letters, 2025. [Online]. Available: https://doi.org/10.1109/LCA.2025.3546811

work page doi:10.1109/lca.2025.3546811 2025
[25]

and Zhang, Hao and Stoica, Ion , booktitle =

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th symposium on operating systems principles, 2023, pp. 611–626. [Online]. Available: https://dl.acm.org/doi/10.1145/3600006.3613165

work page doi:10.1145/3600006.3613165 2023
[26]

Embedding samples dispatching for recommendation model training in edge environments,

G. Li, H. Tan, C. Zhang, H. Ni, Z. Wang, X. Zhang, Y . Xu, and H. Tian, “Embedding samples dispatching for recommendation model training in edge environments,”arXiv preprint arXiv:2512.21615, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2512.21615

work page doi:10.48550/arxiv.2512.21615 2025
[27]

Persia: An open, hybrid system scaling 11 deep learning-based recommenders up to 100 trillion parameters,

X. Lian, B. Yuan, X. Zhu, Y . Wang, Y . He, H. Wu, L. Sun, H. Lyu, C. Liu, X. Donget al., “Persia: An open, hybrid system scaling 11 deep learning-based recommenders up to 100 trillion parameters,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 3288–3298. [Online]. Available: https://doi.org/10.1145/3534...

work page doi:10.1145/3534678.3539070 2022
[28]

Tbgrecall: A generative retrieval model for e-commerce recommendation scenarios,

Z. Liang, C. Wu, D. Huang, W. Sun, Z. Wang, Y . Yan, J. Wu, Y . Jiang, B. Zheng, K. Chenet al., “Tbgrecall: A generative retrieval model for e-commerce recommendation scenarios,” inProceedings of the 34th ACM International Conference on Information and Knowledge Management, 2025, pp. 5863–5870. [Online]. Available: https://dl.acm.org/doi/10.1145/3746252.3761557

work page doi:10.1145/3746252.3761557 2025
[29]

Near-memory computing with compressed embedding table for personalized recommendation,

J. Lim, Y . G. Kim, S. W. Chung, F. Koushanfar, and J. Kong, “Near-memory computing with compressed embedding table for personalized recommendation,”IEEE Transactions on Emerging Topics in Computing, vol. 12, pp. 938–951, 2024. [Online]. Available: https://www.doi.org/10.1109/tetc.2023.3345870

work page doi:10.1109/tetc.2023.3345870 2024
[30]

Decentralized task offloading in edge computing: An offline-to-online reinforcement learning approach,

H. Lin, L. Yang, H. Guo, and J. Cao, “Decentralized task offloading in edge computing: An offline-to-online reinforcement learning approach,” IEEE Transactions on Computers, vol. 73, pp. 1603–1615, 2024. [Online]. Available: https://www.doi.org/10.1109/tc.2024.3377912

work page doi:10.1109/tc.2024.3377912 2024
[31]

DLRMv3: Generative recommendation benchmark in MLPerf in- ference,

L. Ma, H. Kassa, Y . Lee, C. Liu, Z. Kong, Z. Jiang, P. Palangappa, R. Brugarolas, X. Cao, M. Hodak, C.-J. Wu, X. Liu, and J. Zhai, “DLRMv3: Generative recommendation benchmark in MLPerf in- ference,” https://mlcommons.org/2026/02/dlrmv3-inference-meta/, Feb. 2026, mLCommons Technical Report

2026
[32]

Harnessing machine learning in dynamic thermal management in embedded cpu-gpu platforms,

S. Maity, A. Majumder, R. Roy, A. R. Hota, and S. Dey, “Harnessing machine learning in dynamic thermal management in embedded cpu-gpu platforms,”ACM Transactions on Design Automation of Electronic Systems, vol. 30, pp. 1–32, 2024. [Online]. Available: https://www.doi.org/10.1145/3708890

work page doi:10.1145/3708890 2024
[33]

Software-hardware co-design for fast and scalable training of deep learning recommendation models,

D. Mudigere, Y . Hao, J. Huang, Z. Jia, A. Tulloch, S. Sridharan, X. Liu, M. Ozdal, J. Nie, J. Parket al., “Software-hardware co-design for fast and scalable training of deep learning recommendation models,” inProceedings of the 49th Annual International Symposium on Computer Architecture, 2022, pp. 993–1011. [Online]. Available: https://doi.org/10.1145/3...

work page doi:10.1145/3470496.3533727 2022
[34]

Azzolini, et al

M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. Azzoliniet al., “Deep learning recommendation model for personalization and recommendation systems,”arXiv preprint arXiv:1906.00091, 2019. [Online]. Available: https://doi.org/10.48550/arXiv.1906.00091

work page doi:10.48550/arxiv.1906.00091 1906
[35]

Nvidia collective communications library (nccl),

NVIDIA Corporation, “Nvidia collective communications library (nccl),” https://developer.nvidia.com/nccl, 2017

2017
[36]

Nvidia multi-process service (mps) overview,

——, “Nvidia multi-process service (mps) overview,” https://docs.nvidia. com/deploy/mps/index.html, 2023, accessed: 2026-03-26

2023
[37]

Job placement using reinforcement learning in gpu virtualization environment,

J. Oh and Y . Kim, “Job placement using reinforcement learning in gpu virtualization environment,”Cluster Computing, vol. 23, no. 3, pp. 2219–2234, Sep. 2020. [Online]. Available: https: //doi.org/10.1007/s10586-019-03044-7

work page doi:10.1007/s10586-019-03044-7 2020
[39]

Mooncake: Trading more storage for less computation—a{KVCache-centric}architecture for serving {LLM}chatbot,

R. Qin, Z. Li, W. He, J. Cui, F. Ren, M. Zhang, Y . Wu, W. Zheng, and X. Xu, “Mooncake: Trading more storage for less computation—a{KVCache-centric}architecture for serving {LLM}chatbot,” in23rd USENIX conference on file and storage technologies (FAST 25), 2025, pp. 155–170. [Online]. Available: https://www.usenix.org/conference/fast25/presentation/qin

2025
[40]

Dynamollm: Designing llm inference clusters for performance and energy efficiency,

J. Ren, B. Ma, S. Yang, B. Francis, E. K. Ardestani, M. Si, and D. Li, “Machine learning-guided memory optimization for dlrm inference on tiered memory,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 1631–1647. [Online]. Available: https://doi.org/10.1109/HPCA61900. 2025.00121

work page doi:10.1109/hpca61900 2025
[41]

Hierarchical resource partitioning on modern gpus: A reinforcement learning approach,

U. Saroliya, E. Arima, D. Liu, and M. Schulz, “Hierarchical resource partitioning on modern gpus: A reinforcement learning approach,” in2023 IEEE International Conference on Cluster Computing (CLUSTER), 2023, pp. 185–196. [Online]. Available: https://www.doi.org/10.1109/cluster52292.2023.00023

work page doi:10.1109/cluster52292.2023.00023 2023
[42]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review arXiv
[43]

Proximal Policy Optimization Algorithms

[Online]. Available: https://doi.org/10.48550/arXiv.1707.06347

work page internal anchor Pith review doi:10.48550/arxiv.1707.06347
[44]

Taking the human out of the loop: A review of bayesian optimization,

B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas, “Taking the human out of the loop: A review of bayesian optimization,” Proceedings of the IEEE, vol. 104, no. 1, pp. 148–175, 2015. [Online]. Available: https://doi.org/10.1109/JPROC.2015.2494218

work page doi:10.1109/jproc.2015.2494218 2015
[45]

BAT: Efficient generative recommender serving with bipartite attention,

J. Sun, S. Wang, Z. Zhang, Z. Liu, Y . Xu, P. Sun, B. Zhao, B. He, F. Wu, and Z. Wang, “BAT: Efficient generative recommender serving with bipartite attention,” inProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2026. [Online]. Available: https://doi.org/10.1145/3779212.3790131

work page doi:10.1145/3779212.3790131 2026
[46]

Fleetio: Managing multi-tenant cloud storage with multi-agent reinforcement learning,

J. Sun, B. Reidys, D. Li, J. Chang, M. Snir, and J. Huang, “Fleetio: Managing multi-tenant cloud storage with multi-agent reinforcement learning,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2025, pp. 478–492. [Online]. Available: https://doi.org/10.1145/3669940.3707229

work page doi:10.1145/3669940.3707229 2025
[47]

xgr: Efficient generative recommendation serving at scale,

Q. Sun, T. Liu, S. Zhang, S. Wu, P. Yang, H. Liang, M. Li, X. Ma, Z. Liang, Z. Renet al., “xgr: Efficient generative recommendation serving at scale,”arXiv preprint arXiv:2512.11529, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2512.11529

work page doi:10.48550/arxiv.2512.11529 2025
[48]

Dynamic resource allocation of virtualized gpus and cpus for scalable ai workloads in containerized environments,

S. Udkar, V . Gupta, M. Reddy, and M. Vaithianathan, “Dynamic resource allocation of virtualized gpus and cpus for scalable ai workloads in containerized environments,” in2025 5th International Conference on Expert Clouds and Applications (ICOECA). IEEE, 2025, pp. 432–437. [Online]. Available: https://www.doi.org/10.1109/ icoeca66273.2025.00080

work page arXiv 2025
[49]

Relaygr: Scaling long-sequence generative recommendation via cross-stage relay-race inference,

J. Wang, H. Chai, Y . Zhang, Z. Zhou, W. Guo, X. Yang, Q. Tang, B. Pan, J. Zhu, K. Chenget al., “Relaygr: Scaling long-sequence generative recommendation via cross-stage relay-race inference,”arXiv preprint arXiv:2601.01712, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2601.01712

work page doi:10.48550/arxiv.2601.01712 2026
[50]

Multi- dimensional resource allocation in distributed data centers using deep reinforcement learning,

W. Wei, H. Gu, K. Wang, J. Li, X. Zhang, and N. Wang, “Multi- dimensional resource allocation in distributed data centers using deep reinforcement learning,”IEEE Transactions on Network and Service Management, vol. 20, pp. 1817–1829, 2023. [Online]. Available: https://www.doi.org/10.1109/tnsm.2022.3213575

work page doi:10.1109/tnsm.2022.3213575 2023
[51]

Gpu-disaggregated serving for deep learning recommendation models at scale,

L. Yang, Y . Wang, Y . Yu, Q. Weng, J. Dong, K. Liu, C. Zhang, Y . Zi, H. Li, Z. Zhanget al., “Gpu-disaggregated serving for deep learning recommendation models at scale,” in22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), 2025, pp. 847–863. [Online]. Available: https://www.usenix.org/conference/ nsdi25/presentation/yang

2025
[52]

Orca: A distributed serving system for transformer-based generative models,

G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for transformer-based generative models,” in16th USENIX symposium on operating systems design and implementation (OSDI 22), 2022, pp. 521–538. [Online]. Available: https://www.usenix.org/conference/osdi22/presentation/yu

2022
[53]

Near-zero-overhead freshness for recommendation systems via inference-side model updates,

W. Yu, S. Chen, C. Chen, and A. C. Zhou, “Near-zero-overhead freshness for recommendation systems via inference-side model updates,” inProceedings of the 32nd IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2026. [Online]. Available: https://doi.org/10.1109/HPCA68181.2026.11408549

work page doi:10.1109/hpca68181.2026.11408549 2026
[54]

Neural Computation , volume=

Y . Yu, X. Si, C. Hu, and J. Zhang, “A review of recurrent neural networks: Lstm cells and network architectures,”Neural computation, vol. 31, no. 7, pp. 1235–1270, 2019. [Online]. Available: https://www.doi.org/10.1162/neco a 01199

work page doi:10.1162/neco 2019
[55]

Autoshard: Automated embedding table sharding for recommender systems,

D. Zha, L. Feng, B. Bhushanam, D. Choudhary, J. Nie, Y . Tian, J. Chae, Y . Ma, A. Kejariwal, and X. Hu, “Autoshard: Automated embedding table sharding for recommender systems,” inProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 4461–4471. [Online]. Available: https://www.doi.org/10.1145/3534678.3539034

work page doi:10.1145/3534678.3539034 2022
[56]

Actions speak louder than words: trillion-parameter sequential transducers for generative recommendations,

J. Zhai, L. Liao, X. Liu, Y . Wang, R. Li, X. Cao, L. Gao, Z. Gong, F. Gu, J. Heet al., “Actions speak louder than words: trillion-parameter sequential transducers for generative recommendations,” inProceedings of the 41st International Conference on Machine Learning, 2024, pp. 58 484–58 509. [Online]. Available: https://dl.acm.org/doi/10.5555/ 3692070.3694484

work page arXiv 2024
[57]

Recd: Deduplication for end-to-end deep learning recommendation model training infrastructure,

M. Zhao, D. Choudhary, D. Tyagi, A. Somani, M. Kaplan, S.-H. Lin, S. Pumma, J. Park, A. Basant, N. Agarwal et al., “Recd: Deduplication for end-to-end deep learning recommendation model training infrastructure,”Proceedings of Machine Learning and Systems, vol. 5, pp. 754–767, 2023. [On- 12 line]. Available: https://proceedings.mlsys.org/paper files/paper/...

2023
[58]

Onerec-v2 technical report.arXiv preprint arXiv:2508.20900, 2025

G. Zhou, H. Hu, H. Cheng, H. Wang, J. Deng, J. Zhang, K. Cai, L. Ren, L. Ren, L. Yuet al., “Onerec-v2 technical report,”arXiv preprint arXiv:2508.20900, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2508.20900

work page doi:10.48550/arxiv.2508.20900 2025
[59]

Gems: Breaking the long-sequence barrier in generative recommendation with a multi-stream decoder,

Y . Zhou, C. Guo, K. Cai, J. Liu, Q. Luo, R. Tang, H. Li, K. Gai, and G. Zhou, “Gems: Breaking the long-sequence barrier in generative recommendation with a multi-stream decoder,”arXiv preprint arXiv:2602.13631, 2026. [Online]. Available: https://doi.org/ 10.48550/arXiv.2602.13631 13

work page doi:10.48550/arxiv.2602.13631 2026