arxiv: 2402.17152 · v3 · submitted 2024-02-27 · 💻 cs.LG · cs.IR

Recognition: 2 theorem links

· Lean Theorem

Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations

Fangda Gu, Jiaqi Zhai, Leon Gao, Lucy Liao, Michael He, Rui Li, Xing Liu, Xuan Cao, Yinghai Lu, Yueming Wang, Yu Shi, Zhaojie Gong

Pith reviewed 2026-05-13 19:34 UTC · model grok-4.3

classification 💻 cs.LG cs.IR

keywords generative recommendersHSTU architecturesequential transductionscaling lawstrillion-parameter modelsrecommendation systemstransformer alternativesonline A/B testing

0 comments

The pith

Reformulating recommendations as generative sequential transduction with the HSTU architecture lets models scale to 1.5 trillion parameters following power-law improvements in compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard deep learning recommendation models fail to improve meaningfully with added compute despite massive data and features. By recasting the task as generative sequential transduction over user action sequences and introducing the HSTU architecture tuned for high-cardinality non-stationary streams, the authors show empirical power-law scaling of quality with training compute across three orders of magnitude up to GPT-3 scale. In practice this yields a 1.5-trillion-parameter model that lifts online metrics 12.4 percent in A/B tests and ships to production surfaces serving billions of users. The core argument is that the generative framing plus HSTU removes the scaling bottleneck that has limited prior industrial recommenders.

Core claim

Generative Recommenders built on HSTU achieve up to 65.8 percent higher NDCG than baselines on public and synthetic data, run 5.3x to 15.2x faster than FlashAttention2 transformers on long sequences, and when scaled to 1.5 trillion parameters deliver 12.4 percent metric lifts in live A/B tests while exhibiting power-law quality gains with compute through the GPT-3/LLaMa-2 regime.

What carries the argument

HSTU, a transformer-style architecture specialized for high-cardinality non-stationary streaming recommendation data that performs the core sequential transduction step inside the generative modeling framework.

If this is right

Future recommendation models can be improved primarily by scaling training compute rather than by hand-engineering new feature interactions.
Production systems can host trillion-parameter models on surfaces used by billions of daily users while still meeting latency constraints.
Development cycles for new surfaces require fewer manual iterations because quality improves predictably with additional compute.
Carbon cost per incremental quality gain drops because the same scaling curve applies across three orders of magnitude.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed power-law suggests recommendation systems may support the same pre-training and fine-tuning paradigm now common in language models.
High-cardinality sequential data in other domains such as advertising or content moderation could adopt the same generative transduction framing.
If the scaling continues, the field could converge on a small number of foundational recommendation backbones rather than many task-specific DLRMs.

Load-bearing premise

That casting recommendation as generative sequential transduction over action sequences with HSTU captures the essential user-behavior dynamics without creating artifacts absent from conventional DLRM training.

What would settle it

A head-to-head run in which an HSTU generative model trained on identical data and compute budget shows no metric advantage over a strong DLRM baseline, or larger-scale experiments that deviate from the reported power-law fit.

read the original abstract

Large-scale recommendation systems are characterized by their reliance on high cardinality, heterogeneous features and the need to handle tens of billions of user actions on a daily basis. Despite being trained on huge volume of data with thousands of features, most Deep Learning Recommendation Models (DLRMs) in industry fail to scale with compute. Inspired by success achieved by Transformers in language and vision domains, we revisit fundamental design choices in recommendation systems. We reformulate recommendation problems as sequential transduction tasks within a generative modeling framework ("Generative Recommenders"), and propose a new architecture, HSTU, designed for high cardinality, non-stationary streaming recommendation data. HSTU outperforms baselines over synthetic and public datasets by up to 65.8% in NDCG, and is 5.3x to 15.2x faster than FlashAttention2-based Transformers on 8192 length sequences. HSTU-based Generative Recommenders, with 1.5 trillion parameters, improve metrics in online A/B tests by 12.4% and have been deployed on multiple surfaces of a large internet platform with billions of users. More importantly, the model quality of Generative Recommenders empirically scales as a power-law of training compute across three orders of magnitude, up to GPT-3/LLaMa-2 scale, which reduces carbon footprint needed for future model developments, and further paves the way for the first foundational models in recommendations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HSTU gives a workable path to trillion-parameter generative recommenders with real deployment wins, but the power-law scaling needs tighter baseline controls to hold up.

read the letter

The main takeaway is that this paper reframes recommendation as a generative sequential transduction problem and introduces HSTU, an architecture tuned for high-cardinality streaming data. It runs faster than FlashAttention2 transformers on long sequences, scales to 1.5 trillion parameters, shows power-law quality gains with compute across three orders of magnitude, and delivers a 12.4% lift in online A/B tests that led to production deployment on a major platform. Those deployment numbers and the speed claims are the strongest parts of the work. The shift away from classic DLRM-style models toward next-action prediction also feels like a useful conceptual move for handling non-stationary user behavior at scale. The NDCG improvements on synthetic and public data add some supporting evidence. The soft spot is the scaling story. The abstract states the power-law result but does not define the compute axis clearly or show matched scaling curves for a standard transformer or DLRM baseline trained on the same stream. Without those controls it is difficult to tell how much of the scaling comes from HSTU itself versus raw size and data volume. Methods and ablation details are also thin in the summary, which leaves open whether the gains would replicate under identical training regimes. This paper is mainly for teams already running large industrial recommenders who are looking for ways to move past repeated DLRM retraining. It has enough concrete empirical results and production validation to deserve a serious referee, though the scaling section will likely need more experiments in revision. I would send it to review.

Referee Report

3 major / 1 minor

Summary. The paper reformulates recommendation systems as generative sequential transduction tasks and introduces the HSTU architecture tailored to high-cardinality, non-stationary streaming data. It reports up to 65.8% NDCG gains over baselines on synthetic and public datasets, 5.3x–15.2x faster inference than FlashAttention-2 Transformers on 8192-length sequences, deployment of a 1.5-trillion-parameter HSTU-based model yielding 12.4% metric lifts in online A/B tests on a large platform, and empirical power-law scaling of model quality with training compute across three orders of magnitude up to GPT-3/LLaMA-2 scale.

Significance. If the scaling behavior and A/B gains hold under controlled conditions, the work would demonstrate that recommendation models can follow compute-driven scaling laws analogous to those in language modeling, potentially enabling foundational recommendation models while lowering the carbon cost of future development. The reported real-world deployment on surfaces serving billions of users provides direct evidence of practical impact.

major comments (3)

[Abstract] Abstract: the claim that Generative Recommenders 'empirically scales as a power-law of training compute across three orders of magnitude' lacks an explicit definition of the compute axis (FLOPs, effective tokens, or wall-clock GPU-hours) and supplies no tabulated scaling points with error bars or side-by-side curves for compute-matched DLRM or standard Transformer baselines trained on the identical recommendation stream; without these controls the architectural contribution to the observed scaling cannot be isolated from raw capacity or data-volume effects.
[Experiments] Experiments section (synthetic and public dataset results): the reported 65.8% NDCG improvement and 5.3x–15.2x speedups are presented without ablation studies that hold model size, data volume, and training regime fixed while varying only the generative transduction reformulation versus standard DLRM or Transformer baselines; this leaves open whether the gains arise from the HSTU design or from differences in training scale and data.
[Online A/B Tests] Online A/B tests paragraph: the 12.4% metric improvement for the 1.5 T parameter model is stated without naming the precise metrics (e.g., NDCG@K, CTR), the surfaces involved, or statistical significance and confidence intervals, which are load-bearing for the deployment claim.

minor comments (1)

[Abstract] Abstract: the datasets used for the 65.8% NDCG result are described only as 'synthetic and public' without explicit names or references; adding these details would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each of the major comments below and outline the revisions we plan to make to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that Generative Recommenders 'empirically scales as a power-law of training compute across three orders of magnitude' lacks an explicit definition of the compute axis (FLOPs, effective tokens, or wall-clock GPU-hours) and supplies no tabulated scaling points with error bars or side-by-side curves for compute-matched DLRM or standard Transformer baselines trained on the identical recommendation stream; without these controls the architectural contribution to the observed scaling cannot be isolated from raw capacity or data-volume effects.

Authors: We agree that clarifying the compute axis and providing more detailed scaling analysis would strengthen the paper. In the revised version, we will explicitly define the compute axis as the number of training FLOPs. We will include a table listing the scaling points with associated error bars from multiple runs where available. Additionally, we will add plots comparing our model's scaling curve to compute-matched baselines where feasible, though we note that training full baselines at trillion-parameter scale on the same stream is computationally prohibitive, which is why we focused on our architecture's scaling behavior. This will help isolate the contributions. revision: yes
Referee: [Experiments] Experiments section (synthetic and public dataset results): the reported 65.8% NDCG improvement and 5.3x–15.2x speedups are presented without ablation studies that hold model size, data volume, and training regime fixed while varying only the generative transduction reformulation versus standard DLRM or Transformer baselines; this leaves open whether the gains arise from the HSTU design or from differences in training scale and data.

Authors: The comparisons in the experiments section are designed to evaluate the end-to-end performance of the generative reformulation with HSTU against standard DLRM and Transformer baselines under comparable training conditions on the same datasets. However, to directly address the concern, we will add ablation studies in the revised manuscript that fix model size, data volume, and training regime, isolating the effect of the generative transduction approach versus traditional setups. This will clarify the source of the improvements. revision: yes
Referee: [Online A/B Tests] Online A/B tests paragraph: the 12.4% metric improvement for the 1.5 T parameter model is stated without naming the precise metrics (e.g., NDCG@K, CTR), the surfaces involved, or statistical significance and confidence intervals, which are load-bearing for the deployment claim.

Authors: We acknowledge that additional details on the A/B tests would enhance transparency. Due to the proprietary nature of the platform and business considerations, we are limited in disclosing the exact surfaces and specific metric definitions. However, the 12.4% improvement refers to key user engagement metrics, and the tests were conducted with sufficient statistical power to achieve significance at p < 0.01. In the revision, we will provide more context on the metric types (e.g., ranking quality and click-through rates) and confidence intervals where possible without compromising confidentiality. We believe this supports the practical impact claim. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical scaling and architecture claims rest on external benchmarks

full rationale

The paper's central claims involve reformulating recommendations as generative sequential transduction tasks and introducing the HSTU architecture, with quality improvements and power-law scaling reported from direct experiments on synthetic, public, and production A/B test data. No load-bearing steps reduce predictions to fitted inputs by construction, invoke self-citations for uniqueness theorems, or smuggle ansatzes via prior work; the scaling observation is presented as an empirical pattern measured across compute regimes rather than derived from self-referential definitions. The derivation chain is therefore self-contained against the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that recommendation can be reframed as generative sequential transduction and that HSTU is better suited to non-stationary high-cardinality data than prior architectures. No explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Recommendation problems can be reformulated as sequential transduction tasks within a generative modeling framework
Stated directly in the abstract as the foundational reformulation enabling the HSTU design.

pith-pipeline@v0.9.0 · 5595 in / 1297 out tokens · 31161 ms · 2026-05-13T19:34:34.307296+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.PhiForcing phi_equation unclear
HSTU-based Generative Recommenders, with 1.5 trillion parameters, improve metrics in online A/B tests by 12.4% and have been deployed on multiple surfaces of a large internet platform with billions of users. More importantly, the model quality of Generative Recommenders empirically scales as a power-law of training compute across three orders of magnitude, up to GPT-3/LLaMa-2 scale.
IndisputableMonolith.Foundation.LedgerForcing conservation_from_balance unclear
We reformulate recommendation problems as sequential transduction tasks within a generative modeling framework (Generative Recommenders), and propose a new architecture, HSTU, designed for high cardinality, non-stationary streaming recommendation data.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

UniRank: Unified List-wise Reranking via Confidence-Ordered Denoising
cs.IR 2026-05 unverdicted novelty 7.0

UniRank unifies autoregressive and non-autoregressive list-wise reranking via bidirectional modeling in a confidence-ordered iterative denoising process, outperforming baselines on datasets and online tests.
Sample Is Feature: Beyond Item-Level, Toward Sample-Level Tokens for Unified Large Recommender Models
cs.IR 2026-04 unverdicted novelty 7.0

SIF encodes full historical raw samples as tokens via hierarchical quantization to preserve sample context and unify sequential/non-sequential features in large recommender models.
GenRec: A Preference-Oriented Generative Framework for Large-Scale Recommendation
cs.IR 2026-04 unverdicted novelty 7.0

GenRec combines page-wise NTP, token compression, and GRPO-SR reinforcement learning to scale generative retrieval, delivering 9.5% click and 8.7% transaction gains in production A/B tests on the JD App.
IAT: Instance-As-Token Compression for Historical User Sequence Modeling in Industrial Recommender Systems
cs.IR 2026-04 unverdicted novelty 7.0

IAT compresses each historical interaction instance into a unified embedding token via temporal-order or user-order schemes, allowing standard sequence models to learn long-range preferences with better performance an...
Next-Scale Generative Reranking: A Tree-based Generative Rerank Method at Meituan
cs.IR 2026-04 unverdicted novelty 7.0

NSGR is a tree-structured generative reranker that progressively generates optimal lists via next-scale expansion and multi-scale neighbor loss to balance perspectives and align training signals.
Tencent Advertising Algorithm Challenge 2025: All-Modality Generative Recommendation
cs.IR 2026-04 accept novelty 7.0

Releases TencentGR-1M and TencentGR-10M datasets with baselines for all-modality generative recommendation in advertising, including weighted evaluation for conversions.
Conditional Memory Enhanced Item Representation for Generative Recommendation
cs.IR 2026-05 unverdicted novelty 6.0

ComeIR introduces dual-level Engram memory and memory-restoring prediction to reconstruct SID-token embeddings and restore token granularity in generative recommendation.
UxSID: Semantic-Aware User Interests Modeling for Ultra-Long Sequence
cs.AI 2026-05 unverdicted novelty 6.0

UxSID uses Semantic IDs and dual-level attention for semantic-group shared interest memory to efficiently model ultra-long user sequences, claiming SOTA performance and 0.337% revenue lift in advertising A/B tests.
An Embarrassingly Simple Graph Heuristic Reveals Shortcut-Solvable Benchmarks for Sequential Recommendation
cs.IR 2026-05 conditional novelty 6.0

A simple graph heuristic without training or sequence encoders matches or outperforms trained generative recommenders on 10 of 14 sequential recommendation benchmarks by exploiting local transition and feature shortcuts.
Bridging Textual Profiles and Latent User Embeddings for Personalization
cs.IR 2026-05 unverdicted novelty 6.0

BLUE aligns LLM-generated textual user profiles with embedding-based recommendation objectives via reinforcement learning and next-item text supervision, yielding better zero-shot performance and cross-domain transfer...
CapsID: Soft-Routed Variable-Length Semantic IDs for Generative Recommendation
cs.IR 2026-05 unverdicted novelty 6.0

CapsID uses probabilistic capsule routing and confidence-based termination to generate variable-length semantic IDs, improving recall by 9.6% over strong baselines with half the latency of dual-representation systems.
Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation
cs.IR 2026-04 unverdicted novelty 6.0

PAD-Rec augments standard draft models with item-position and step-position embeddings plus learnable gates, delivering up to 3.1x wall-clock speedup and 5% average gain over strong speculative-decoding baselines on f...
RoTE: Coarse-to-Fine Multi-Level Rotary Time Embedding for Sequential Recommendation
cs.IR 2026-04 unverdicted novelty 6.0

RoTE is a multi-level rotary time embedding module that explicitly models time spans in sequential recommendation and improves NDCG@5 by up to 20.11% when added to standard backbones on public benchmarks.
UniRec: Bridging the Expressive Gap between Generative and Discriminative Recommendation via Chain-of-Attribute
cs.IR 2026-04 unverdicted novelty 6.0

UniRec bridges the expressive gap in generative recommendation by prefixing semantic ID sequences with structured attribute tokens, recovering explicit feature crossing and yielding +22.6% HR@50 gains plus online lift...
MBGR: Multi-Business Prediction for Generative Recommendation at Meituan
cs.IR 2026-04 unverdicted novelty 6.0

MBGR is a new generative recommendation framework using business-aware semantic IDs, multi-business prediction, and label dynamic routing to handle multiple businesses without seesaw effects or representation confusio...
TwiSTAR:Think Fast, Think Slow, Then Act,Generative Recommendation with Adaptive Reasoning
cs.IR 2026-05 unverdicted novelty 5.0

TwiSTAR learns to switch between fast SID retrieval and slow rationale-generating reasoning in generative recommendation, yielding better accuracy-latency trade-offs on three datasets.
UxSID: Semantic-Aware User Interests Modeling for Ultra-Long Sequence
cs.AI 2026-05 unverdicted novelty 5.0

UxSID introduces semantic-group shared interest memory with Semantic IDs and dual-level attention to model ultra-long user sequences, claiming state-of-the-art results and a 0.337% revenue lift in advertising A/B tests.
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents
cs.LG 2026-05 unverdicted novelty 5.0

PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing...
Harmonizing Generative Retrieval and Ranking in Chain-of-Recommendation
cs.IR 2026-04 unverdicted novelty 5.0

RecoChain unifies generative candidate generation via hierarchical semantic IDs and SIM-based ranking in a single Transformer to improve top-K recommendation performance.
PRAGMA: Revolut Foundation Model
cs.LG 2026-04 unverdicted novelty 5.0

PRAGMA pre-trains a Transformer on heterogeneous banking events with a tailored self-supervised masked objective, yielding embeddings that support strong downstream performance on credit scoring, fraud detection, and ...
A Cascaded Generative Approach for e-Commerce Recommendations
cs.AI 2026-05 unverdicted novelty 4.0

A cascaded generative system for e-commerce recommendations using theme and keyword generation with teacher-student fine-tuning achieves a 2.7% lift in cart adds per page view.
RecGPT-Mobile: On-Device Large Language Models for User Intent Understanding in Taobao Feed Recommendation
cs.IR 2026-05 unverdicted novelty 4.0

RecGPT-Mobile runs a compact LLM on phones to understand evolving user intent from behaviors and improve mobile e-commerce recommendations.

Reference graph

Works this paper leans on

143 extracted references · 143 canonical work pages · cited by 21 Pith papers · 5 internal anchors

[1]

and Ermon, Stefano and Rudra, Atri and R

Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. Flash. Advances in Neural Information Processing Systems , year=

work page
[2]

2023 , eprint=

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author=. 2023 , eprint=

work page 2023
[4]

2022 , eprint=

Efficiently Scaling Transformer Inference , author=. 2022 , eprint=

work page 2022
[5]

2020 , eprint=

COLD: Towards the Next Generation of Pre-Ranking System , author=. 2020 , eprint=

work page 2020
[6]

Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) , volume =

Shrivastava, Anshumali and Li, Ping , booktitle =. Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) , volume =

work page
[8]

and Garcia-Molina, H

Chen Li and Chang, E. and Garcia-Molina, H. and Wiederhold, G. , journal=. Clustering for approximate similarity search in high-dimensional spaces , year=

work page
[9]

Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data , pages =

Zhai, Jiaqi and Lou, Yin and Gehrke, Johannes , title =. Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data , pages =. 2011 , isbn =

work page 2011
[11]

2022 , eprint=

DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction , author=. 2022 , eprint=

work page 2022
[12]

, title =

Yang, Ji and Yi, Xinyang and Zhiyuan Cheng, Derek and Hong, Lichan and Li, Yang and Xiaoming Wang, Simon and Xu, Taibai and Chi, Ed H. , title =. Companion Proceedings of the Web Conference 2020 , pages =. 2020 , isbn =

work page 2020
[13]

Proceedings of the 1st Workshop on Deep Learning for Recommender Systems , pages =

Cheng, Heng-Tze and Koc, Levent and Harmsen, Jeremiah and Shaked, Tal and Chandra, Tushar and Aradhye, Hrishi and Anderson, Glen and Corrado, Greg and Chai, Wei and Ispir, Mustafa and Anil, Rohan and Haque, Zakaria and Hong, Lichan and Jain, Vihan and Liu, Xiaobing and Shah, Hemal , title =. Proceedings of the 1st Workshop on Deep Learning for Recommender...

work page 2016
[15]

2018 , location =

Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts , author =. 2018 , location =

work page 2018
[16]

Training Deep Nets with Sublinear Memory Cost

Tianqi Chen and Bing Xu and Chiyuan Zhang and Carlos Guestrin , title =. CoRR , volume =. 2016 , url =. 1604.06174 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2016
[17]

Transformers Are RNNs: Fast Autoregressive Transformers with Linear Attention , year =

Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran. Transformers Are RNNs: Fast Autoregressive Transformers with Linear Attention , year =. Proceedings of the 37th International Conference on Machine Learning , articleno =

work page
[18]

2020 , eprint=

Linformer: Self-Attention with Linear Complexity , author=. 2020 , eprint=

work page 2020
[19]

The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation

Chen, Mia Xu and Firat, Orhan and Bapna, Ankur and Johnson, Melvin and Macherey, Wolfgang and Foster, George and Jones, Llion and Schuster, Mike and Shazeer, Noam and Parmar, Niki and Vaswani, Ashish and Uszkoreit, Jakob and Kaiser, Lukasz and Chen, Zhifeng and Wu, Yonghui and Hughes, Macduff. The Best of Both Worlds: Combining Recent Advances in Neural M...

work page doi:10.18653/v1/p18-1008 2018
[20]

2023 , eprint=

Breaking the Curse of Quality Saturation with User-Centric Ranking , author=. 2023 , eprint=

work page 2023
[21]

Efficiently Modeling Long Sequences with Structured State Spaces , booktitle =

Albert Gu and Karan Goel and Christopher R. Efficiently Modeling Long Sequences with Structured State Spaces , booktitle =. 2022 , url =

work page 2022
[22]

Bowen Peng and Jeffrey Quesnelle and Honglu Fan and Enrico Shippole , booktitle=. Ya. 2024 , url=

work page 2024
[23]

2018 , numpages =

Zhou, Guorui and Zhu, Xiaoqiang and Song, Chenru and Fan, Ying and Zhu, Han and Ma, Xiao and Yan, Yanghui and Jin, Junqi and Li, Han and Gai, Kun , title =. 2018 , numpages =

work page 2018
[24]

Proceedings of the 29th ACM International Conference on Information & Knowledge Management , pages =

Pi, Qi and Zhou, Guorui and Zhang, Yujing and Wang, Zhe and Ren, Lejian and Fan, Ying and Zhu, Xiaoqiang and Gai, Kun , title =. Proceedings of the 29th ACM International Conference on Information & Knowledge Management , pages =. 2020 , isbn =. doi:10.1145/3340531.3412744 , abstract =

work page doi:10.1145/3340531.3412744 2020
[25]

Language Models are Unsupervised Multitask Learners , author=

work page
[26]

Language Models are Few-Shot Learners , author=

work page
[27]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

work page 2023
[28]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[29]

ADKDD'14: Proceedings of the Eighth International Workshop on Data Mining for Online Advertising , year =

Xinran He and Junfeng Pan and Ou Jin and Tianbing Xu and Bo Liu and Tao Xu and Yanxin Shi and Antoine Atallah and Ralf Herbrich and Stuart Bowers and Joaquin Quiñonero Candela , title =. ADKDD'14: Proceedings of the Eighth International Workshop on Data Mining for Online Advertising , year =

work page
[30]

Reddi and Sanjiv Kumar , title =

Srinadh Bhojanapalli and Chulhee Yun and Ankit Singh Rawat and Sashank J. Reddi and Sanjiv Kumar , title =. Proceedings of the 37th International Conference on Machine Learning,. 2020 , url =

work page 2020
[31]

2016 , eprint=

Deep Networks with Stochastic Depth , author=. 2016 , eprint=

work page 2016
[32]

Smith and Mike Lewis , title =

Ofir Press and Noah A. Smith and Mike Lewis , title =. The Tenth International Conference on Learning Representations,. 2022 , url =

work page 2022
[33]

2021 , eprint=

Self-attention Does Not Need O(n^2) Memory , author=. 2021 , eprint=

work page 2021
[34]

and Kaiser,

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser,. Attention is All You Need , year =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

work page
[36]

, title =

Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J. , title =. J. Mach. Learn. Res. , month =. 2020 , issue_date =

work page 2020
[37]

Le , editor =

Weizhe Hua and Zihang Dai and Hanxiao Liu and Quoc V. Le , editor =. Transformer Quality in Linear Time , booktitle =. 2022 , url =

work page 2022
[38]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

work page
[39]

2023 , eprint=

TWIN: TWo-stage Interest Network for Lifelong User Behavior Modeling in CTR Prediction at Kuaishou , author=. 2023 , eprint=

work page 2023
[40]

2022 , eprint=

Monolith: Real Time Recommendation System With Collisionless Embedding Table , author=. 2022 , eprint=

work page 2022
[41]

Proceedings of the 10th ACM Conference on Recommender Systems , pages =

Covington, Paul and Adams, Jay and Sargin, Emre , title =. Proceedings of the 10th ACM Conference on Recommender Systems , pages =. 2016 , isbn =

work page 2016
[42]

Proceedings of the 13th ACM Conference on Recommender Systems , pages=

Sampling-bias-corrected neural modeling for large corpus item recommendations , author=. Proceedings of the 13th ACM Conference on Recommender Systems , pages=

work page
[43]

9th International Conference on Learning Representations,

Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby , title =. 9th International Conference on Learning Representations,. 2021 , url =

work page 2021
[44]

2020 , eprint=

Superbloom: Bloom filter meets Transformer , author=. 2020 , eprint=

work page 2020
[46]

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =

Wang, Jizhe and Huang, Pipei and Zhao, Huan and Zhang, Zhibo and Zhao, Binqiang and Lee, Dik Lun , title =. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2018 , isbn =

work page 2018
[47]

Proceedings of the 2018 World Wide Web Conference , pages =

Eksombatchai, Chantat and Jindal, Pranav and Liu, Jerry Zitao and Liu, Yuchen and Sharma, Rahul and Sugnet, Charles and Ulrich, Mark and Leskovec, Jure , title =. Proceedings of the 2018 World Wide Web Conference , pages =. 2018 , isbn =

work page 2018
[48]

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =

Zhu, Han and Li, Xiang and Zhang, Pengye and Li, Guozheng and He, Jie and Li, Han and Gai, Kun , title =. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2018 , isbn =

work page 2018
[49]

Proceedings of the 37th International Conference on Machine Learning , articleno =

Zhuo, Jingwei and Xu, Ziru and Dai, Wei and Zhu, Han and Li, Han and Xu, Jian and Gai, Kun , title =. Proceedings of the 37th International Conference on Machine Learning , articleno =. 2020 , publisher =

work page 2020
[50]

Proceedings of the 30th ACM International Conference on Information and Knowledge Management , pages =

Gao, Weihao and Fan, Xiangjun and Wang, Chong and Sun, Jiankai and Jia, Kai and Xiao, Wenzi and Ding, Ruofan and Bin, Xingyan and Yang, Hui and Liu, Xiaobing , title =. Proceedings of the 30th ACM International Conference on Information and Knowledge Management , pages =. 2021 , isbn =

work page 2021
[51]

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages =

Zhou, Chang and Ma, Jianxin and Zhang, Jianwei and Zhou, Jingren and Yang, Hongxia , title =. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages =. 2021 , isbn =

work page 2021
[52]

Proceedings of the 37th International Conference on Machine Learning , articleno =

Xiong, Ruibin and Yang, Yunchang and He, Di and Zheng, Kai and Zheng, Shuxin and Xing, Chen and Zhang, Huishuai and Lan, Yanyan and Wang, Liwei and Liu, Tie-Yan , title =. Proceedings of the 37th International Conference on Machine Learning , articleno =. 2020 , publisher =

work page 2020
[53]

Fourteenth ACM Conference on Recommender Systems (RecSys'20) , pages =

Rendle, Steffen and Krichene, Walid and Zhang, Li and Anderson, John , title =. Fourteenth ACM Conference on Recommender Systems (RecSys'20) , pages =. 2020 , isbn =

work page 2020
[54]

arXiv preprint arXiv:2306.04039 , year=

Revisiting Neural Retrieval on Accelerators , author=. arXiv preprint arXiv:2306.04039 , year=

work page arXiv
[56]

Proceedings of the 15th ACM Conference on Recommender Systems , pages =

A Case Study on Sampling Strategies for Evaluating Neural Sequential Item Recommendation Models , author =. Proceedings of the 15th ACM Conference on Recommender Systems , pages =. 2021 , isbn =

work page 2021
[58]

Session-based Recommendations with Recurrent Neural Networks , booktitle =

Bal. Session-based Recommendations with Recurrent Neural Networks , booktitle =. 2016 , url =

work page 2016
[59]

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =

Liu, Langming and Cai, Liu and Zhang, Chi and Zhao, Xiangyu and Gao, Jingtong and Wang, Wanyu and Lv, Yifu and Fan, Wenqi and Wang, Yiqi and He, Ming and Liu, Zitao and Li, Qing , title =. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =. 2023 , isbn =. doi:10.1145/3539618.3591717 , ...

work page doi:10.1145/3539618.3591717 2023
[60]

Proceedings of the 37th International Conference on Machine Learning , articleno =

Chen, Ting and Kornblith, Simon and Norouzi, Mohammad and Hinton, Geoffrey , title =. Proceedings of the 37th International Conference on Machine Learning , articleno =. 2020 , abstract =

work page 2020
[61]

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =

Rajbhandari, Samyam and Rasley, Jeff and Ruwase, Olatunji and He, Yuxiong , title =. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =. 2020 , isbn =

work page 2020
[62]

2022 , eprint=

Reducing Activation Recomputation in Large Transformer Models , author=. 2022 , eprint=

work page 2022
[63]

Factorization Machines , year=

Rendle, Steffen , booktitle=. Factorization Machines , year=

work page
[64]

Proceedings of the 26th International Joint Conference on Artificial Intelligence , pages =

Guo, Huifeng and Tang, Ruiming and Ye, Yunming and Li, Zhenguo and He, Xiuqiang , title =. Proceedings of the 26th International Joint Conference on Artificial Intelligence , pages =. 2017 , isbn =

work page 2017
[65]

Proceedings of the 26th International Joint Conference on Artificial Intelligence , pages =

Xiao, Jun and Ye, Hao and He, Xiangnan and Zhang, Hanwang and Wu, Fei and Chua, Tat-Seng , title =. Proceedings of the 26th International Joint Conference on Artificial Intelligence , pages =. 2017 , isbn =

work page 2017
[66]

Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 , pages =

Gillenwater, Jennifer and Kulesza, Alex and Fox, Emily and Taskar, Ben , title =. Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 , pages =. 2014 , publisher =

work page 2014
[69]

2023 , eprint=

XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models , author=. 2023 , eprint=

work page 2023
[70]

Getting Deep Recommenders Fit: Bloom Embeddings for Sparse Binary Input/Output Networks , year =

Serr\`. Getting Deep Recommenders Fit: Bloom Embeddings for Sparse Binary Input/Output Networks , year =. Proceedings of the Eleventh ACM Conference on Recommender Systems , pages =. doi:10.1145/3109859.3109876 , abstract =

work page doi:10.1145/3109859.3109876
[72]

Large Language Models are Zero-Shot Rankers for Recommender Systems , booktitle =

Yupeng Hou and Junjie Zhang and Zihan Lin and Hongyu Lu and Ruobing Xie and Julian McAuley and Wayne Xin Zhao , year=. Large Language Models are Zero-Shot Rankers for Recommender Systems , booktitle =. 2305.08845 , archivePrefix=

work page arXiv
[73]

2023 , eprint=

RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2023 , eprint=

work page 2023
[74]

2020 , eprint=

GLU Variants Improve Transformer , author=. 2020 , eprint=

work page 2020
[75]

CoRR , volume =

Zeyu Cui and Jianxin Ma and Chang Zhou and Jingren Zhou and Hongxia Yang , title =. CoRR , volume =. 2022 , url =. doi:10.48550/arXiv.2205.08084 , eprinttype =. 2205.08084 , timestamp =

work page doi:10.48550/arxiv.2205.08084 2022
[76]

2018 International Conference on Data Mining (ICDM) , pages=

Self-attentive sequential recommendation , author=. 2018 International Conference on Data Mining (ICDM) , pages=

work page 2018
[77]

Proceedings of the 28th ACM International Conference on Information and Knowledge Management , pages =

Sun, Fei and Liu, Jun and Wu, Jian and Pei, Changhua and Lin, Xiao and Ou, Wenwu and Jiang, Peng , title =. Proceedings of the 28th ACM International Conference on Information and Knowledge Management , pages =. 2019 , isbn =

work page 2019
[81]

Transformer Memory as a Differentiable Search Index , volume =

Tay, Yi and Tran, Vinh and Dehghani, Mostafa and Ni, Jianmo and Bahri, Dara and Mehta, Harsh and Qin, Zhen and Hui, Kai and Zhao, Zhe and Gupta, Jai and Schuster, Tal and Cohen, William W and Metzler, Donald , booktitle =. Transformer Memory as a Differentiable Search Index , volume =

work page
[83]

and Bengio, Samy and Weston, Jason , title =

Gupta, Maya R. and Bengio, Samy and Weston, Jason , title =. J. Mach. Learn. Res. , month =. 2014 , issue_date =

work page 2014
[87]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Scaling Data-Constrained Language Models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[88]

Text is all you need: Learning language representations for sequential recommendation

Jiacheng Li and Ming Wang and Jin Li and Jinmiao Fu and Xin Shen and Jingbo Shang and Julian McAuley. Text is all you need: Learning language representations for sequential recommendation. KDD. 2023

work page 2023
[90]

2023 , eprint=

A Bi-Step Grounding Paradigm for Large Language Models in Recommendation Systems , author=. 2023 , eprint=

work page 2023
[91]

2022 , eprint=

M6-Rec: Generative Pretrained Language Models are Open-Ended Recommender Systems , author=. 2022 , eprint=

work page 2022
[93]

Dynamically Scaled RoPE further increases performance of long context LLaMA with zero fine-tuning , author=

work page
[94]

Tallrec: An effective and efficient tuning framework to align large language model with recommendation

Bao, K., Zhang, J., Zhang, Y., Wang, W., Feng, F., and He, X. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems, RecSys ’23. ACM, September 2023. doi:10.1145/3604915.3608857. URL http://dx.doi.org/10.1145/3604915.3608857

work page doi:10.1145/3604915.3608857 2023
[95]

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, ...

work page 2020
[96]

Twin: Two-stage interest network for lifelong user behavior modeling in ctr prediction at kuaishou, 2023

Chang, J., Zhang, C., Fu, Z., Zang, X., Guan, L., Lu, J., Hui, Y., Leng, D., Niu, Y., Song, Y., and Gai, K. Twin: Two-stage interest network for lifelong user behavior modeling in ctr prediction at kuaishou, 2023

work page 2023
[97]

Behavior sequence transformer for e-commerce recommendation in alibaba

Chen, Q., Zhao, H., Li, W., Huang, P., and Ou, W. Behavior sequence transformer for e-commerce recommendation in alibaba. In Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data, DLP-KDD '19, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450367837. doi:10.1145/3326937.3341261. UR...

work page doi:10.1145/3326937.3341261 2019
[98]

A simple framework for contrastive learning of visual representations

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML'20, 2020

work page 2020
[99]

Wide & deep learning for recommender systems

Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson, G., Corrado, G., Chai, W., Ispir, M., Anil, R., Haque, Z., Hong, L., Jain, V., Liu, X., and Shah, H. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, DLRS 2016, pp.\ 7–10, 2016. ISBN 9781450347952

work page 2016
[100]

Generating Long Sequences with Sparse Transformers

Child, R., Gray, S., Radford, A., and Sutskever, I. Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019. URL http://arxiv.org/abs/1904.10509

work page internal anchor Pith review Pith/arXiv arXiv 1904

Showing first 80 references.