pith. the verified trust layer for science. sign in

arxiv: 2511.14881 · v5 · submitted 2025-11-18 · 💻 cs.IR

SilverTorch: A Unified Model-based System to Democratize Large-Scale Recommendation on GPUs

Pith reviewed 2026-05-17 20:18 UTC · model grok-4.3

classification 💻 cs.IR
keywords recommendation servingGPU accelerationmodel unificationBloom indexANN searchdeep learning modelsthroughput scalingcost efficiency
0
0 comments X p. Extension

The pith

Unifying recommendation serving into a single GPU model replaces separate CPU indexing services and enables more complex architectures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a system that integrates feature filtering and nearest-neighbor search directly into the recommendation model running on GPUs. It replaces standalone indexing and filtering services with model layers, using a model-based GPU Bloom index for filtering and a fused integer kernel for search. The co-design reduces memory use and computation overhead. This unification supports an additional scoring layer and multi-task retrieval that improves accuracy while allowing more complex model designs. A sympathetic reader would care because the approach makes it practical to run advanced recommendation models at industry scale without dedicated extra hardware.

Core claim

SilverTorch brings all components of deep learning recommendation model serving into one unified model by replacing standalone indexing and filtering services with model layers. It proposes a model-based GPU Bloom index for feature filtering and a fused Int8 ANN kernel for nearest neighbor search. Through co-design of the ANN search and feature filtering, it reduces GPU memory usage and eliminates computation. This enables an OverArch scoring layer and multi-task retrieval with a Value Model to aggregate scores, improving retrieval accuracy and supporting more complex models.

What carries the argument

The model-based GPU Bloom index paired with the fused Int8 ANN kernel, which together move filtering and search inside the model to cut memory and compute overhead.

If this is right

  • Throughput reaches up to 23.7 times higher than prior approaches.
  • Cost efficiency improves by 13.35 times relative to CPU-based serving.
  • More complex models that use learned similarities and multi-task retrieval become feasible while accuracy rises.
  • The design supports online serving of hundreds of models for varied applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same unification pattern could be tested on other large-scale similarity tasks outside recommendation, such as content retrieval in media platforms.
  • Developers might explore adding further model components like learned filters without rebuilding separate infrastructure layers.
  • Production teams could measure end-to-end latency gains when the same GPU resources handle both retrieval and scoring in one pass.

Load-bearing premise

The integrated Bloom index and ANN kernel preserve or improve retrieval accuracy on diverse real-world queries without introducing systematic misses or biases.

What would settle it

Running the system on a production dataset with query patterns different from the evaluated ones and measuring whether recall or ranking quality drops below the level achieved by separate CPU indexing services.

Figures

Figures reproduced from arXiv: 2511.14881 by Bi Xue, Chao Yang, Eric Dong, Fei Ding, Harry Tran, Hong Wu, Hongzhang Yin, Jiazhou Wang, Ke Huang, Keke Zhai, Kevin Greer, Lei Chen, Liang Wang, Min Ni, Min Yu, Pawel Garbacki, Peng Xia, Qianqian Zhang, Rui Jian, Rui Zhang, Ryan Chang, Xialu Li, Xiaoheng Mao, Yang Liu, Yanli Zhao, Yanzun Huang, Yijie Deng, Yiming Ma, Yiyi Pan, Zheng Fang, Zhen Wang.

Figure 1
Figure 1. Figure 1: Motivation Examples: Figure 1(a) shows the service-based recommendation serving. Figure 1(b) shows the versioning [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall workflow of SilverTorch model publish and serving and Pseudocode of SilverTorch model definition. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Bloom Index Design for Feature Filtering [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An example for co-designed ANN with Filtering. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: End-to-end performance Results for Retrieval and ESR. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Latency/Recall results of different ANN methods. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Latency/recall on open dataset. Task Method Recall@20 Recall@100 Recall@200 Recall@500 Recall@1000 QPS E-Task Baseline 0.08239 0.19179 0.29131 0.4295 0.44127 51 SilverTorch 0.07163 0.20306 0.28923 0.4237 0.44651 1210 SilverTorch-OverArch 0.09181 (+28.2%) 0.24189 (+19.1%) 0.33148 (+14.6%) 0.44758 (+5.6%) 0.45727 (+2.4%) 771 C-Task Baseline 0.09642 0.25217 0.3551 0.4971 0.5162 51 SilverTorch 0.09652 0.25291 … view at source ↗
Figure 9
Figure 9. Figure 9: Recall/Probes on open dataset. for ANN and 18.2MB for bloom index search). The ANN and filter￾ing co-designed index optimization reduces the temporary memory needed for bloom index to 0.14MB so memory needed is 18.2MB for the ANN search and filtering. Meanwhile, corresponding latency reduces from 1.55ms to 0.72ms. On average, the latency of executing 13 [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Performance of Bloom Index [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Latency and memory utilization comparison be [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
read the original abstract

Serving deep learning based recommendation models (DLRM) at scale is challenging. Existing approaches rely on dedicated ANN indexing and filtering services on CPUs, suffering from non-negligible costs and missing co-design opportunities. Such inefficiency makes them difficult to support complex model architectures, such as learned similarities and multi-task retrieval. In this paper, we present SilverTorch, a model-based serving system that brings all components into one unified model. It unifies model serving by replacing standalone indexing and filtering services with model layers. We propose a model-based GPU Bloom index for feature filtering and a fused Int8 ANN kernel for nearest neighbor search. Through co-design of the ANN search and feature filtering, we reduce GPU memory usage and eliminate computation. Benefiting from this design, we scale up retrieval by introducing an OverArch scoring layer and a multi-task retrieval with a Value Model to aggregate scores. These advancements improve the retrieval accuracy and enable future studies for serving more complex models. Our evaluation on industry-scale datasets show that SilverTorch achieves up to 23.7\times higher throughput compared to the state-of-the-art approaches. We also demonstrate that SilverTorch solution is 13.35\times more cost-efficient than CPU-based solution while improving accuracy via serving more complex models. SilverTorch is deployed at scale, serving hundreds of models online and supporting recommendation for diverse applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. SilverTorch is a unified model-based serving system for deep learning recommendation models (DLRM) that integrates indexing, filtering, and scoring into GPU model layers. It replaces standalone CPU-based ANN and filtering services with a model-based GPU Bloom index for feature filtering and a fused Int8 ANN kernel for nearest-neighbor search. The design includes an OverArch scoring layer and multi-task retrieval via a Value Model to support complex architectures such as learned similarities. On industry-scale datasets the system reports up to 23.7× higher throughput than state-of-the-art approaches, 13.35× better cost efficiency than CPU baselines, and accuracy gains, with production deployment serving hundreds of models.

Significance. If the throughput, cost, and accuracy claims are substantiated with complete experimental detail, the work has substantial practical significance for large-scale recommendation infrastructure. Unifying indexing/filtering into model layers removes separate services and enables more expressive retrieval models on GPUs, which could lower operational costs while improving quality. The reported deployment at scale and the focus on co-design of quantization, Bloom filtering, and ANN kernels constitute concrete engineering contributions that address real production bottlenecks.

major comments (3)
  1. [Evaluation] Evaluation section: the central throughput claim of up to 23.7× improvement is presented without named baselines, hardware configurations, batch sizes, or error bars. Because the speedup is the primary empirical support for the unified model-based approach, the absence of these details prevents assessment of whether the gains are attributable to the Bloom index + fused Int8 kernel co-design or to other factors.
  2. [Model Architecture / Evaluation] Model description and evaluation: the claim that the model-based GPU Bloom index together with the fused Int8 ANN kernel preserves or improves retrieval accuracy without systematic misses is load-bearing for the accuracy-improvement argument, yet no recall@K, bias, or query-distribution analysis is reported. This directly affects the weakest assumption identified in the review.
  3. [§4] §4 (or equivalent): ablation studies isolating the contributions of the OverArch scoring layer and the Value Model for multi-task retrieval are missing. Without them it is impossible to determine whether the reported accuracy gains stem from these new components or from the underlying ANN and filtering changes.
minor comments (2)
  1. [Abstract] Abstract: the industry-scale datasets used for the reported numbers should be named (even if only by size or domain) to allow readers to gauge representativeness.
  2. [System Design] A diagram or pseudocode equation showing how the Bloom index and Int8 ANN kernel are fused inside the model forward pass would clarify the claimed elimination of computation and memory reduction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment point by point below and will revise the paper to incorporate additional details and analyses where appropriate.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the central throughput claim of up to 23.7× improvement is presented without named baselines, hardware configurations, batch sizes, or error bars. Because the speedup is the primary empirical support for the unified model-based approach, the absence of these details prevents assessment of whether the gains are attributable to the Bloom index + fused Int8 kernel co-design or to other factors.

    Authors: We agree that these experimental details are necessary for full assessment and reproducibility. In the revised manuscript we will explicitly name the baselines (including the specific ANN libraries and CPU-based systems compared against), specify the hardware configurations (GPU models and counts), report the batch sizes used for each throughput measurement, and add error bars from repeated runs. These additions will clarify that the reported gains derive from the co-design of the model-based Bloom index and fused Int8 ANN kernel rather than other factors. revision: yes

  2. Referee: [Model Architecture / Evaluation] Model description and evaluation: the claim that the model-based GPU Bloom index together with the fused Int8 ANN kernel preserves or improves retrieval accuracy without systematic misses is load-bearing for the accuracy-improvement argument, yet no recall@K, bias, or query-distribution analysis is reported. This directly affects the weakest assumption identified in the review.

    Authors: We acknowledge that additional quantitative support for the accuracy claims would strengthen the argument. While the current manuscript reports accuracy improvements enabled by serving more complex models, we will add recall@K metrics, bias analysis, and query-distribution studies in the revision to demonstrate that the Bloom index and fused kernel preserve retrieval quality without introducing systematic misses. revision: yes

  3. Referee: [§4] §4 (or equivalent): ablation studies isolating the contributions of the OverArch scoring layer and the Value Model for multi-task retrieval are missing. Without them it is impossible to determine whether the reported accuracy gains stem from these new components or from the underlying ANN and filtering changes.

    Authors: We agree that isolating the contributions of the OverArch scoring layer and Value Model is important. In the revised version we will add ablation studies (in §4 or a new subsection) that separately measure the impact of these components on accuracy, holding the underlying ANN and filtering fixed, to clarify their role in the observed gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an empirical systems paper whose central claims rest on measured throughput, cost-efficiency, and accuracy improvements from a GPU-based unified serving architecture. The abstract and description present engineering co-design choices (model-based GPU Bloom index, fused Int8 ANN kernel, OverArch scoring layer) validated through experiments on industry-scale datasets rather than any mathematical derivation chain, first-principles predictions, or fitted parameters that reduce to the paper's own inputs by construction. No self-definitional steps, load-bearing self-citations, or ansatz smuggling appear; the results are externally falsifiable via replication of the reported benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The central claims rest on hardware assumptions about GPU memory bandwidth and on engineering choices for index and kernel parameters that are tuned rather than derived.

free parameters (2)
  • Bloom filter size and hash functions
    Chosen to balance memory and false-positive rate for feature filtering.
  • Int8 quantization parameters for ANN
    Tuned for speed-accuracy trade-off in the fused kernel.
axioms (1)
  • domain assumption GPU hardware supplies sufficient memory bandwidth and compute to make the fused kernels faster than CPU services
    Invoked to explain the reported throughput and cost gains.
invented entities (2)
  • model-based GPU Bloom index no independent evidence
    purpose: Replace standalone feature filtering service inside the model
    New component proposed to enable unified serving.
  • OverArch scoring layer no independent evidence
    purpose: Scale retrieval by aggregating multi-task scores
    Introduced to support more complex model architectures.

pith-pipeline@v0.9.0 · 5644 in / 1412 out tokens · 62735 ms · 2026-05-17T20:18:16.586044+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 3 internal anchors

  1. [1]

    14 {TensorFlow}: a system for {Large-Scale} machine learning

    Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 14 {TensorFlow}: a system for {Large-Scale} machine learning. In12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283, 2016

  2. [2]

    Under- standing scaling laws for recommendation models

    Newsha Ardalani, Carole-Jean Wu, Zeliang Chen, Bhargav Bhushanam, and Adnan Aziz. Understanding scaling laws for recommendation models.arXiv preprint arXiv:2208.08489, 2022

  3. [3]

    k-means++: The advantages of careful seeding

    David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. Technical report, Stanford, 2006

  4. [4]

    Aws p4d.24xlarge instance cost

    aws. Aws p4d.24xlarge instance cost. https://instances.vantage.sh/aws/ec2/p4d. 24xlarge?region=us-west-2, 2023

  5. [5]

    Aws r6i.8xlarge instance cost

    aws. Aws r6i.8xlarge instance cost. https://instances.vantage.sh/aws/ec2/r6i. 8xlarge?region=us-west-1, 2023

  6. [6]

    Aws x2idn.24xlarge instance cost

    aws. Aws x2idn.24xlarge instance cost. https://instances.vantage.sh/aws/ec2/ x2idn.24xlarge?region=us-west-1, 2023

  7. [7]

    Itemsage: Learning product embeddings for shopping rec- ommendations at pinterest

    Paul Baltescu, Haoyu Chen, Nikil Pancha, Andrew Zhai, Jure Leskovec, and Charles Rosenberg. Itemsage: Learning product embeddings for shopping rec- ommendations at pinterest. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2703–2711, 2022

  8. [8]

    Linr: Model based neural retrieval on gpus at linkedin

    Fedor Borisyuk, Qingquan Song, Mingzhou Zhou, Ganesh Parameswaran, Madhu Arun, Siva Popuri, Tugrul Bingol, Zhuotao Pei, Kuang-Hsuan Lee, Lu Zheng, et al. Linr: Model based neural retrieval on gpus at linkedin. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management, pages 4366–4373, 2024

  9. [9]

    The anatomy of a large-scale hypertextual web search engine.Computer networks and ISDN systems, 30(1-7):107–117, 1998

    Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine.Computer networks and ISDN systems, 30(1-7):107–117, 1998

  10. [10]

    Scalability and efficiency chal- lenges in large-scale web search engines

    B Barla Cambazoglu and Ricardo Baeza-Yates. Scalability and efficiency chal- lenges in large-scale web search engines. InProceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 1223–1226, 2016

  11. [11]

    Spann: Highly-efficient billion-scale approximate nearest neighborhood search.Advances in Neural Information Processing Systems, 34:5199–5212, 2021

    Qi Chen, Bing Zhao, Haidong Wang, Mingqin Li, Chuanjie Liu, Zengzhong Li, Mao Yang, and Jingdong Wang. Spann: Highly-efficient billion-scale approximate nearest neighborhood search.Advances in Neural Information Processing Systems, 34:5199–5212, 2021

  12. [12]

    Deep neural networks for youtube recommendations

    Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommendations. InProceedings of the 10th ACM conference on recommender systems, pages 191–198, 2016

  13. [13]

    Retrieval with learned similarities

    Bailu Ding and Jiaqi Zhai. Retrieval with learned similarities. InProceedings of the ACM on Web Conference 2025, pages 1626–1637, 2025

  14. [14]

    The faiss library

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library. 2024

  15. [15]

    Compiling machine learning programs via high-level tracing

    Roy Frostig, Matthew James Johnson, and Chris Leary. Compiling machine learning programs via high-level tracing. InSysML conference 2018, 2019

  16. [16]

    Faiss on the gpu limitations

    github. Faiss on the gpu limitations. https://github.com/facebookresearch/faiss/ wiki/Faiss-on-the-GPU#limitations, 2023

  17. [17]

    Bitfunnel: Revisiting signatures for search

    Bob Goodwin, Michael Hopcroft, Dan Luu, Alex Clemmer, Mihaela Curmei, Sameh Elnikety, and Yuxiong He. Bitfunnel: Revisiting signatures for search. InProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 605–614, 2017

  18. [18]

    Embedding-based retrieval in facebook search

    Jui-Ting Huang, Ashish Sharma, Shuying Sun, Li Xia, David Zhang, Philip Pronin, Janani Padmanabhan, Giuseppe Ottaviano, and Linjun Yang. Embedding-based retrieval in facebook search. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2553–2561, 2020

  19. [19]

    Torchrec: a pytorch domain library for recommendation systems

    Dmytro Ivchenko, Dennis Van Der Staay, Colin Taylor, Xing Liu, Will Feng, Rahul Kindi, Anirudh Sudarshan, and Shahin Sefati. Torchrec: a pytorch domain library for recommendation systems. InProceedings of the 16th ACM Conference on Recommender Systems, pages 482–483, 2022

  20. [20]

    Diskann: Fast accurate billion-point nearest neighbor search on a single node.Advances in neural information processing Systems, 32, 2019

    Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, and Rohan Kadekodi. Diskann: Fast accurate billion-point nearest neighbor search on a single node.Advances in neural information processing Systems, 32, 2019

  21. [21]

    Billion-scale similarity search with GPUs.IEEE Transactions on Big Data, 7(3):535–547, 2019

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with GPUs.IEEE Transactions on Big Data, 7(3):535–547, 2019

  22. [22]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  23. [23]

    Yu A Malkov and Dmitry A Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs.IEEE transac- tions on pattern analysis and machine intelligence, 42(4):824–836, 2018

  24. [24]

    Milvus gpu limitations

    Milvus. Milvus gpu limitations. https://milvus.io/docs/gpu_index.md, 2023

  25. [25]

    Ray: A distributed framework for emerging {AI} applications

    Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. Ray: A distributed framework for emerging {AI} applications. In13th USENIX symposium on operating systems design and implementation (OSDI 18), pages 561–577, 2018

  26. [26]

    Software-hardware co-design for fast and scalable training of deep learning recommendation models

    Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, et al. Software-hardware co-design for fast and scalable training of deep learning recommendation models. InProceedings of the 49th Annual International Sympo- sium on Computer Architecture, pages 993–1011, 2022

  27. [27]

    Deep Learning Recommendation Model for Personalization and Recommendation Systems

    Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole- Jean Wu, Alisson G Azzolini, et al. Deep learning recommendation model for personalization and recommendation systems.arXiv preprint arXiv:1906.00091, 2019

  28. [28]

    Merlin: a gpu accelerated recommendation framework

    Even Oldridge, Julio Perez, Ben Frederickson, Nicolas Koumchatzky, Minseok Lee, Zehuan Wang, Lei Wu, Fan Yu, Rick Zamora, Onur Yilmaz, et al. Merlin: a gpu accelerated recommendation framework. InProceedings of IRS, 2020

  29. [29]

    Lance: Efficient random access in columnar storage through adaptive structural encodings.arXiv preprint arXiv:2504.15247, 2025

    Weston Pace, Chang She, Lei Xu, Will Jones, Albert Lockett, Jun Wang, and Raunak Shah. Lance: Efficient random access in columnar storage through adaptive structural encodings.arXiv preprint arXiv:2504.15247, 2025

  30. [30]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    A Paszke. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703, 2019

  31. [31]

    Manas hnsw realtime: Powering realtime embedding-based re- trieval

    Pinterest. Manas hnsw realtime: Powering realtime embedding-based re- trieval. https://medium.com/pinterest-engineering/manas-hnsw-realtime- powering-realtime-embedding-based-retrieval-dc71dfd6afdd, 2023

  32. [32]

    Rapidsai/raft: Raft contains fundamental widely-used algorithms and primitives for data science, graph and machine learning., 2022

    Rapidsai. Rapidsai/raft: Raft contains fundamental widely-used algorithms and primitives for data science, graph and machine learning., 2022. URL https: //github.com/rapidsai/raft

  33. [33]

    Milvus: A purpose-built vector data management system

    Jianguo Wang, Xiaomeng Yi, Rentong Guo, Hai Jin, Peng Xu, Shengjun Li, Xi- angyu Wang, Xiangzhou Guo, Chengming Li, Xiaohai Xu, et al. Milvus: A purpose-built vector data management system. InProceedings of the 2021 Inter- national Conference on Management of Data, pages 2614–2627, 2021

  34. [34]

    Billion-scale commodity embedding for e-commerce recommendation in alibaba

    Jizhe Wang, Pipei Huang, Huan Zhao, Zhibo Zhang, Binqiang Zhao, and Dik Lun Lee. Billion-scale commodity embedding for e-commerce recommendation in alibaba. InProceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 839–848, 2018

  35. [35]

    Sampling-bias-corrected neural modeling for large corpus item recommendations

    Xinyang Yi, Ji Yang, Lichan Hong, Derek Zhiyuan Cheng, Lukasz Heldt, Aditee Kumthekar, Zhe Zhao, Li Wei, and Ed Chi. Sampling-bias-corrected neural modeling for large corpus item recommendations. InProceedings of the 13th ACM conference on recommender systems, pages 269–277, 2019

  36. [36]

    Learning a unified embedding for visual search at pinterest

    Andrew Zhai, Hao-Yu Wu, Eric Tzeng, Dong Huk Park, and Charles Rosenberg. Learning a unified embedding for visual search at pinterest. InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2412–2420, 2019

  37. [37]

    Revisiting neural retrieval on accelerators

    Jiaqi Zhai, Zhaojie Gong, Yueming Wang, Xiao Sun, Zheng Yan, Fu Li, and Xing Liu. Revisiting neural retrieval on accelerators. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5520–5531, 2023

  38. [38]

    Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations

    Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhao- jie Gong, Fangda Gu, Michael He, et al. Actions speak louder than words: Trillion- parameter sequential transducers for generative recommendations.arXiv preprint arXiv:2402.17152, 2024

  39. [39]

    Wukong: Towards a scaling law for large-scale recommendation.arXiv preprint arXiv:2403.02545, 2024

    Buyun Zhang, Liang Luo, Yuxin Chen, Jade Nie, Xi Liu, Daifeng Guo, Yanli Zhao, Shen Li, Yuchen Hao, Yantao Yao, et al. Wukong: Towards a scaling law for large-scale recommendation.arXiv preprint arXiv:2403.02545, 2024

  40. [40]

    Embedding in recommender systems: A survey.arXiv preprint arXiv:2310.18608, 2023

    Xiangyu Zhao, Maolin Wang, Xinjian Zhao, Jiansheng Li, Shucheng Zhou, Dawei Yin, Qing Li, Jiliang Tang, and Ruocheng Guo. Embedding in recommender systems: A survey.arXiv preprint arXiv:2310.18608, 2023. 15