xGR: Efficient Generative Recommendation Serving at Scale

Depei Qian; Hailong Yang; Haotian Liang; Ke Zhang; Menxin Li; Minchao Zhang; Peijun Yang; Qingxiao Sun; Shen Zhang; Siyu Wu

arxiv: 2512.11529 · v3 · pith:DQEH5OPRnew · submitted 2025-12-12 · 💻 cs.LG

xGR: Efficient Generative Recommendation Serving at Scale

Qingxiao Sun , Tongxuan Liu , Shen Zhang , Siyu Wu , Peijun Yang , Haotian Liang , Menxin Li , Xiaolong Ma

show 8 more authors

Zhiwei Liang Ziyi Ren Minchao Zhang Yifan Wang Xinyu Liu Ke Zhang Hailong Yang Depei Qian

This is my paper

classification 💻 cs.LG

keywords recommendationservingbeamdecodegenerativeitemlongsorting

0 comments

read the original abstract

Recommendation system delivers substantial economic benefits by providing personalized predictions. Generative recommendation (GR) integrates LLMs to enhance the understanding of long user-item sequences. Despite employing attention-based architectures, GR's workload differs markedly from that of LLM serving. GR typically processes long prompt while producing short, fixed-length outputs, yet the computational cost of each decode phase is especially high due to the large beam width. Furthermore, since the beam search involves a vast item space, the sorting overhead becomes particularly time-consuming. We propose xGR, a GR-oriented serving system that meets strict low-latency requirements under high-concurrency scenarios. First, xGR unifies the processing of prefill and decode phases through staged computation and separated KV cache. Second, xGR enables early sorting termination and mask-based item filtering with data structure reuse. Third, xGR reconstructs the overall pipeline to exploit multi-level overlap and multi-stream parallelism. The experiments on real-world datasets demonstrate that xGR achieves at least 2.89x throughput compared to the state-of-the-art baseline under strict latency constraints.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving
cs.DC 2026-05 unverdicted novelty 6.0

HELM adaptively partitions HBM between EMB and KV caches via a three-layer PPO controller and EMB-KV-aware scheduling, reducing P99 latency by 24-38% while achieving 93.5-99.6% SLO satisfaction on production workloads.