On the Memorization Behavior of LLMs in Generative Recommendation: Observations, Implications, and Training Strategies

Bhuvesh Kumar; Clark Mingxuan Ju; Donald Loveland; Kijung Shin; Liam Collins; Neil Shah; Sunkyung Lee; Sunwoo Kim

arxiv: 2606.17276 · v3 · pith:UR6WF6GMnew · submitted 2026-06-15 · 💻 cs.IR · cs.LG

On the Memorization Behavior of LLMs in Generative Recommendation: Observations, Implications, and Training Strategies

Sunwoo Kim , Sunkyung Lee , Clark Mingxuan Ju , Donald Loveland , Bhuvesh Kumar , Kijung Shin , Neil Shah , Liam Collins This is my paper

Pith reviewed 2026-06-27 02:27 UTC · model grok-4.3

classification 💻 cs.IR cs.LG

keywords generative recommendationlarge language modelsmemorizationone-hop transitionsitem-item relationstraining strategiesmulti-hop co-occurrences

0 comments

The pith

LLMs in generative recommendation achieve most of their gains over baselines through one-hop memorization of training sequences

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how large language models applied to generative recommendation rely on one-hop memorization, meaning they recommend items that directly follow items seen in training sequences. This behavior occurs more frequently in LLMs than in traditional non-LLM models and explains the bulk of their measured improvements. The authors argue that further progress depends on moving beyond single-step transitions. They introduce IIRG, a training strategy that supplies the model with multi-hop co-occurrence patterns and semantic similarities between items. Experiments show IIRG raises accuracy, particularly for users whose target items lack any one-hop connection in the training data.

Core claim

LLMs perform one-hop memorization more than non-LLM generative recommendation models, and the vast majority of their gains over baselines occur on users whose target items can be reached via one-hop transitions from training data. Teaching LLMs richer item-item relations through IIRG, which incorporates collaborative signals from multi-hop co-occurrences across user sequences and semantic relations among thematically similar items, improves results over standard next-item prediction training, with the largest lifts on users outside one-hop coverage.

What carries the argument

IIRG, a training strategy that adds collaborative relations from multi-hop item co-occurrences and semantic relations among similar items as additional signals during fine-tuning

If this is right

Standard next-item prediction training causes LLMs to favor one-hop memorization over broader generalization.
IIRG produces higher accuracy than next-item prediction alone, especially on users whose test items lack one-hop predecessors in training.
Multi-hop co-occurrence and semantic signals can be added to LLM training without changing model architecture.
Performance gaps on non-memorizable cases can be narrowed by exposing the model to these additional item relations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same one-hop bias may limit LLM performance in other sequential prediction tasks outside recommendation.
IIRG-style signals could be tested as a lightweight addition to existing fine-tuning pipelines for sequential models.
Measuring how IIRG affects memorization on longer sequences or in cross-domain settings would clarify its scope.

Load-bearing premise

One-hop memorization can be isolated as the dominant source of LLM performance gains without other pretrained knowledge or modeling factors explaining the same improvements.

What would settle it

Retraining the same LLMs on data with all one-hop transitions removed and finding that their advantage over non-LLM baselines remains would undermine the claim that one-hop memorization drives most gains.

Figures

Figures reproduced from arXiv: 2606.17276 by Bhuvesh Kumar, Clark Mingxuan Ju, Donald Loveland, Kijung Shin, Liam Collins, Neil Shah, Sunkyung Lee, Sunwoo Kim.

**Figure 1.** Figure 1: An example of one-hop memorization. The LLM memorizes that soccer shirts and socks are purchased after soccer balls and cleats, respectively (red arrows), then recommends the latter items to users who purchased the former. history, GR methods use a generative model to generate the identifier of the target item, rather than scoring a predefined set of candidate items (Senel et al., 2024; Liu et al., 2025b… view at source ↗

**Figure 2.** Figure 2: One-hop memorization in Sports. LLMs using SIDs trained solely with next-item prediction (Naive) rely more on one-hop memorization than TIGER. This reliance leads to gains concentrated among users covered by such memorization, with limited benefits for the remaining users. IIRG, our training method, reduces this reliance and delivers stronger performance gains for these remaining users. Details are in § 3… view at source ↗

**Figure 3.** Figure 3: (a), LLM-based GR models consistently exhibit a higher one-hop memorization ratio in their top-5 recommendations than TIGER, the nonLLM-based GR baseline. This trend holds for both TIDs and SIDs. This result suggests that LLMs rely more heavily on one-hop memorization than the non-LLM-based GR model. Moreover, this tendency remains even when we (1) apply regularizations to LLMs or (2) scale TIGER to an L… view at source ↗

**Figure 4.** Figure 4: Overview of IIRG. To train LLM-based GR models, IIRG uses three tasks and jointly optimizes them: (1) next-item prediction (red box), (2) collaborative neighbor generation for each item (green box), and (3) semantic neighbor generation for each item (blue box). 4 Proposed method: IIRG To encourage LLMs to learn item–item relations beyond one-hop item transitions, we introduce IIRG (Item–Item Relation Gene… view at source ↗

**Figure 5.** Figure 5: Generalization to SIDs. IIRG remains effective under SIDs (IIRG-SID), improving over the LLM trained only with next-item prediction (Naive-SID) and outperforming the strongest SID-based baseline (LC-Rec-SID). prediction (P5 (Geng et al., 2022), ReAT (Cao et al., 2024), LC-Rec (Zheng et al., 2024a), EAGERLLM (Hong et al., 2025)), and (4) six LLM-based GR models using SIDs (PLUM (He et al., 2026), OneRec-Th… view at source ↗

**Figure 6.** Figure 6: Achievement of design goal. IIRG reduces one-hop memorization in Naive, an LLM trained solely with next-item prediction (a), and yields larger percentage gains for non-one-hop-memorization-benefiting users than for one-hop-memorization-benefiting users (b). These trends hold across datasets and ID types. applicable.5 Additional results with different SIDs are provided in Appendix D.11. Result. As shown in… view at source ↗

**Figure 7.** Figure 7: Training sample of next-item prediction with term IDs. Given the instruction and input, the LLM is trained to generate the output autoregressively using teacher forcing. Instruction: Given a user’s historical item interaction sequence, predict the next item the user is most likely to interact with. Each item in the sequence is represented by a unique identifier composed of 3 special tokens enclosed in squa… view at source ↗

**Figure 8.** Figure 8: Training sample of next-item prediction with semantic IDs. Given the instruction and input, the LLM is trained to generate the output autoregressively using teacher forcing. adopted by LLaMA-Factory (Zheng et al., 2024b). B.1 Next-item prediction The example data sample using term-based item identifiers and semantic identifiers are in [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Training sample of collaborative neighbor generation with term IDs. Given the instruction and input, the LLM is trained to generate the output autoregressively using teacher forcing. Instruction: Given a target item in the format [keywords, title], recommend five items that are most likely to be copurchased with it. \n Return the items sorted by likelihood, from most likely to least likely, and format eac… view at source ↗

**Figure 10.** Figure 10: Training sample of collaborative neighbor generation with semantic IDs. Given the instruction and input, the LLM is trained to generate the output autoregressively using teacher forcing. C.3 Item identifiers of IIRG As discussed in Section 5, IIRG can be coupled with both term IDs and semantic IDs. We now elaborate on how we obtain the respective identifiers used as item identifiers. Term IDs (TIDs). We … view at source ↗

**Figure 11.** Figure 11: Training sample of semantic neighbor generation with term IDs. Given the instruction and input, the LLM is trained to generate the output autoregressively using teacher forcing. Instruction: Given a target item in the format [keywords, title], list five items that are most semantically similar to it. \n Return the items sorted by similarity, from most similar to least similar, and format each item as [key… view at source ↗

**Figure 12.** Figure 12: Training sample of semantic neighbor generation with semantic IDs. Given the instruction and input, the LLM is trained to generate the output autoregressively using teacher forcing. being changed between iterations. Semantic IDs (SIDs). For semantic item identifiers, we adopt the identifiers proposed by He et al. (2026). Specifically, we apply RQ-VAE-based vector quantization (Lee et al., 2022a) together… view at source ↗

**Figure 13.** Figure 13: Informativeness of our proposed neighbors. The ratio of users benefiting from our collaborative and semantic neighbors is comparable to that of users benefiting from one-hop memorization (see (a)). Moreover, the three neighbor types have a Jaccard similarity below 0.5, suggesting that they are not merely redundant (see (b)). In addition, among non-memorization-benefiting users, our neighbors benefit a gro… view at source ↗

**Figure 14.** Figure 14: Effect of existing regularization techniques. In the Beauty dataset, existing regularization techniques cannot effectively reduce the LLM’s one-hop memorization behavior and/or hurt LLM’s recommendation performance. D.2 Effect of regularization techniques In this section, we explore whether common regularization techniques can alleviate the LLM’s tendency to rely on the transition patterns (Section 3)… view at source ↗

**Figure 15.** Figure 15: Effect of model size on memorization. In the Sports dataset, the size of a GR model does not show a clear correlation with its reliance on one-hop memorization, for either the non-LLM-based model, TIGER, or LLM-based models. This suggests that model size alone does not determine memorization behavior. an item (e.g., “Given the user interaction history, the user is likely to purchase camping products”), o… view at source ↗

**Figure 16.** Figure 16: Performance of diverse semantic neighbor search methods for IIRG. Searching semantic neighbors with semantic IDs (SIDs) still significantly improves the recommendation performance of the LLM trained solely with next-item prediction (Naive), while the performance drop compared with full dense embedding search (Emb) remains relatively small. rich item relations learned through our tasks and instead revert… view at source ↗

**Figure 17.** Figure 17: In-depth analysis regarding warm- and cold-start items. The gain of IIRG over the LLM trained solely with next-item prediction (Naive) is consistently larger for users whose test items have limited user interactions than for users whose test items have rich user interactions. Ratio of memorization (a) Reliance on memorization under R-K-Means SIDs (b) Recommendation performance under R-K-Means SIDs Recall … view at source ↗

**Figure 18.** Figure 18: Generalization to other semantic IDs (SIDs). The observation that LLMs rely more heavily on one-hop memorization than TIGER, as well as the effectiveness of IIRG in reducing memorization and improving recommendation performance, also holds under Residual-K-Means (R-K-Means) SIDs [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗

read the original abstract

Generative recommendation (GR) has emerged as a promising direction for recommender systems. Recently, large language models (LLMs) have been increasingly adopted for GR, as their rich pretrained knowledge is expected to help them generalize beyond common user behavior patterns that traditional memorization-oriented baselines can capture. However, existing LLM-based GR works largely ignore LLMs' well-known tendency to memorize, which, if present in LLMs fine-tuned for GR, would restrict their utilization of pretrained knowledge. In this work, we investigate this concern by examining one-hop memorization, where a model recommends items that are direct successors of items in the training data. We show that LLMs do this more than non-LLM-based GR models-in fact, the vast majority of their gains over GR baselines are actually on users whose target items can be predicted through one-hop memorization. We intuit that improving performance on the remaining users requires LLMs to learn richer item-item relations beyond one-hop transitions. To achieve this, we propose IIRG, a novel training strategy that teaches LLMs to capture: (1) collaborative relations derived from item co-occurrences across multiple hops in user sequences, and (2) semantic relations among items with similar themes, both of which can serve as useful recommendation signals. We show that IIRG significantly improves over LLMs trained solely with standard next-item prediction, with especially large gains for users whose test items are not covered by train-time one-hop transitions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs in generative rec mostly beat baselines via one-hop memorization of train successors, and IIRG tries to add multi-hop and semantic signals, but the user split risks popularity confounds.

read the letter

The main thing to know is that this paper claims the bulk of LLM gains in generative recommendation come from one-hop memorization rather than broader generalization, and it introduces IIRG to train on multi-hop co-occurrences plus semantic item relations instead.

The one-hop dominance observation and the IIRG objective appear new relative to the cited GR and memorization literature. The paper does a solid job naming a concrete training issue and offering a targeted fix that directly addresses the users left behind by standard next-item prediction.

The soft spot is the user partition itself. The central claim needs the split between one-hop covered and not to cleanly isolate memorization, yet nothing in the abstract shows controls for item popularity, user activity, or other factors that could drive both the transitions and LLM performance. Popular items are likelier to appear in one-hop pairs, so the groups may not be comparable. That makes attribution to memorization less secure. IIRG gains on the harder users also need checks that the new signals do not simply create fresh train-test overlaps.

This is for researchers working on LLM-based recommenders. Readers in that area will find the diagnostic and the training change useful to test. The work has a clear empirical question and a reproducible method, so it deserves a serious referee.

I would send it to peer review with requests for partition controls and full result details.

Referee Report

2 major / 1 minor

Summary. The paper investigates one-hop memorization in LLM-based generative recommendation, where models recommend items that are direct successors in training sequences. It claims LLMs exhibit this behavior more than non-LLM GR baselines, with the vast majority of performance gains occurring for users whose test items are covered by such one-hop transitions. The authors propose IIRG, a training strategy that augments next-item prediction with multi-hop collaborative co-occurrences and semantic item relations, reporting significant improvements over standard fine-tuning, especially on users whose test items lack one-hop coverage.

Significance. If the empirical results hold after addressing potential confounds, the work identifies a key limitation in current LLM-GR approaches and supplies a concrete training method to encourage richer item-item relations. This could shift practice toward more generalizable LLM recommenders and prompt further study of memorization versus generalization trade-offs in sequential recommendation.

major comments (2)

[experimental analysis of one-hop memorization] The user partitioning into one-hop versus non-one-hop groups (described in the experimental analysis of memorization behavior) does not report stratification, matching, or ablation on item popularity, user activity level, or semantic category. This partition is load-bearing for the central claim that LLM gains over GR baselines are driven by one-hop memorization; without such controls the attribution remains vulnerable to confounding.
[IIRG training strategy and evaluation] The evaluation of IIRG gains on non-one-hop users assumes the added multi-hop and semantic signals improve generalization without new overfitting to training co-occurrences, yet no analysis of train/test overlap on the constructed IIRG signals or ablation removing high-frequency items is provided.

minor comments (1)

[Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., relative improvement on non-one-hop users) and the primary datasets used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and outline the revisions we will make to strengthen the empirical support for our claims.

read point-by-point responses

Referee: [experimental analysis of one-hop memorization] The user partitioning into one-hop versus non-one-hop groups (described in the experimental analysis of memorization behavior) does not report stratification, matching, or ablation on item popularity, user activity level, or semantic category. This partition is load-bearing for the central claim that LLM gains over GR baselines are driven by one-hop memorization; without such controls the attribution remains vulnerable to confounding.

Authors: We acknowledge that explicit controls for item popularity, user activity, and semantic category would further isolate the role of one-hop memorization. While the head-to-head comparison with non-LLM GR baselines (trained on identical data) already holds many distributional factors constant, we agree that stratification strengthens the attribution. In the revision we will add results stratified by item popularity quartiles and user activity levels, plus a brief discussion of semantic category balance. These additional tables will show that the concentration of LLM gains on one-hop users persists across strata. revision: yes
Referee: [IIRG training strategy and evaluation] The evaluation of IIRG gains on non-one-hop users assumes the added multi-hop and semantic signals improve generalization without new overfitting to training co-occurrences, yet no analysis of train/test overlap on the constructed IIRG signals or ablation removing high-frequency items is provided.

Authors: We agree that verifying the generalization of the IIRG signals is necessary. In the revised manuscript we will report the fraction of multi-hop co-occurrences and semantic relations that appear in the test set, and we will add an ablation that removes or down-weights high-frequency items from the IIRG objective. These analyses will demonstrate that the reported gains on non-one-hop users are not driven by leakage or frequency bias. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observational study with no self-referential derivations or fitted predictions.

full rationale

The paper presents an empirical investigation of LLM memorization in generative recommendation via user partitioning on one-hop transitions, followed by a proposed training strategy (IIRG) evaluated through experiments. No mathematical derivation chain, equations, or first-principles results are claimed that reduce to inputs by construction. Claims rest on experimental comparisons (LLM vs. GR baselines, IIRG vs. standard next-item prediction) rather than quantities defined inside the paper or self-citations that bear the central load. The one-hop partition and IIRG signals are defined externally to the results and tested on held-out data, making the work self-contained against external benchmarks with no patterns matching self-definitional, fitted-input, or ansatz-smuggling circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard supervised fine-tuning assumptions plus the implicit modeling choice that one-hop transitions are the primary memorization mechanism worth isolating; no new mathematical axioms or invented physical entities are introduced.

axioms (1)

domain assumption Standard next-item prediction loss is the baseline training objective for GR
Invoked when contrasting IIRG against 'LLMs trained solely with standard next-item prediction'

pith-pipeline@v0.9.1-grok · 5827 in / 1405 out tokens · 35768 ms · 2026-06-27T02:27:14.438712+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 8 linked inside Pith

[1]

Code and datasets of this work , howpublished =
[2]

WWW , year=

Generative large recommendation models: emerging trends in llms for recommendation , author=. WWW , year=
[3]

CIKM , year=

Generative Recommendation with Semantic IDs: A Practitioner's Handbook , author=. CIKM , year=
[4]

NeurIPS , year=

Recommender systems with generative retrieval , author=. NeurIPS , year=
[5]

arXiv preprint arXiv:2510.24431 , year=

Minionerec: An open-source framework for scaling generative recommendation , author=. arXiv preprint arXiv:2510.24431 , year=

arXiv
[6]

ACL , year=

Gram: Generative recommendation via semantic-aware multi-granular late fusion , author=. ACL , year=
[7]

WWW , year=

Plum: Adapting pre-trained language models for industrial-scale generative recommendations , author=. WWW , year=
[8]

arXiv preprint arXiv:2601.06798 , year=

Unleashing the Native Recommendation Potential: LLM-Based Generative Recommendation via Structured Term Identifiers , author=. arXiv preprint arXiv:2601.06798 , year=

arXiv
[9]

arXiv preprint arXiv:2512.24762 , year=

OpenOneRec Technical Report , author=. arXiv preprint arXiv:2512.24762 , year=

arXiv
[10]

ICDE , year=

Adapting large language models by integrating collaborative semantics for recommendation , author=. ICDE , year=
[11]

WWW , year=

Eager-llm: Enhancing large language models as recommenders through exogenous behavior-semantic integration , author=. WWW , year=
[12]

RecSys , year=

Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5) , author=. RecSys , year=
[13]

NAACL , year=

Aligning large language models with recommendation knowledge , author=. NAACL , year=
[14]

arXiv preprint arXiv:2303.08774 , year=

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

Pith/arXiv arXiv
[15]

arXiv preprint arXiv:2302.13971 , year=

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

Pith/arXiv arXiv
[16]

ACL , year=

Generative explore-exploit: Training-free optimization of generative recommender systems using llm optimizers , author=. ACL , year=
[17]

EMNLP , year=

LOHRec: Leveraging Order and Hierarchy in Generative Sequential Recommendation , author=. EMNLP , year=
[18]

NeurIPS , year=

Attention is all you need , author=. NeurIPS , year=
[19]

LREC-COLING , year=

Large language models for generative recommendation: A survey and visionary discussions , author=. LREC-COLING , year=
[20]

RecSys , year=

Semantic ids for joint generative search and recommendation , author=. RecSys , year=
[21]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv
[22]

ICDM , year=

Self-attentive sequential recommendation , author=. ICDM , year=
[23]

arXiv preprint arXiv:2602.05945 , year=

AgenticTagger: Structured Item Representation for Recommendation with LLM Agents , author=. arXiv preprint arXiv:2602.05945 , year=

arXiv
[24]

arXiv preprint arXiv:2506.05176 , year=

Qwen3 embedding: Advancing text embedding and reranking through foundation models , author=. arXiv preprint arXiv:2506.05176 , year=

Pith/arXiv arXiv
[25]

IEEE Transactions on Audio, Speech and Language Processing , volume=

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-Tuning , author=. IEEE Transactions on Audio, Speech and Language Processing , volume=
[26]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , year=
[27]

WWW , year=

Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering , author=. WWW , year=
[28]

SIGIR , year=

Lightgcn: Simplifying and powering graph convolution network for recommendation , author=. SIGIR , year=
[29]

SIGIR , year=

Are graph augmentations necessary? simple graph contrastive learning for recommendation , author=. SIGIR , year=
[30]

, author=

Feature-level deeper self-attention network for sequential recommendation. , author=. IJCAI , year=
[31]

CIKM , year=

S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization , author=. CIKM , year=
[32]

CIKM , year=

Learnable item tokenization for generative recommendation , author=. CIKM , year=
[33]

CVPR , year=

Autoregressive image generation using residual quantization , author=. CVPR , year=
[34]

SIGIR , year=

Idgenrec: Llm-recsys alignment with textual id learning , author=. SIGIR , year=
[35]

ACL , year=

Llamafactory: Unified efficient fine-tuning of 100+ language models , author=. ACL , year=
[36]

ICLR , year=

Decoupled weight decay regularization , author=. ICLR , year=
[37]

COLING , year=

Learning transition patterns by large language models for sequential recommendation , author=. COLING , year=
[38]

ICML , year=

Actionpiece: Contextually tokenizing action sequences for generative recommendation , author=. ICML , year=
[39]

ICLR , year=

Efficient inference for large language model-based generative recommendation , author=. ICLR , year=
[40]

WWW , year=

Item-based collaborative filtering recommendation algorithms , author=. WWW , year=
[41]

arXiv preprint arXiv:2510.11639 , year=

Onerec-think: In-text reasoning for generative recommendation , author=. arXiv preprint arXiv:2510.11639 , year=

arXiv
[42]

arXiv preprint arXiv:2603.17540 , year=

Deploying Semantic ID-based Generative Retrieval for Large-Scale Podcast Discovery at Spotify , author=. arXiv preprint arXiv:2603.17540 , year=

arXiv
[43]

KDD , year=

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters , author=. KDD , year=
[44]

arXiv preprint arXiv:2507.06261 , year=

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

Pith/arXiv arXiv
[45]

NeurIPS , year=

Sequence to sequence learning with neural networks , author=. NeurIPS , year=
[46]

RecSys , year=

Beyond accuracy: evaluating recommender systems by coverage and serendipity , author=. RecSys , year=
[47]

KDD , year=

Inferring networks of substitutable and complementary products , author=. KDD , year=
[48]

AAAI , year=

Align ^3 GR: Unified Multi-Level Alignment for LLM-based Generative Recommendation , author=. AAAI , year=
[49]

arXiv preprint arXiv:2509.25522 , year=

Understanding generative recommendation with semantic ids from a model-scaling view , author=. arXiv preprint arXiv:2509.25522 , year=

Pith/arXiv arXiv
[50]

Communications of the ACM , volume=

Shortcut learning of large language models in natural language understanding , author=. Communications of the ACM , volume=. 2023 , publisher=

2023
[51]

ICLR , year=

Assessing robustness to spurious correlations in post-training language models , author=. ICLR , year=
[52]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Do llms overcome shortcut learning? an evaluation of shortcut challenges in large language models , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024
[53]

arXiv preprint arXiv:2410.02650 , year=

Undesirable memorization in large language models: A survey , author=. arXiv preprint arXiv:2410.02650 , year=

arXiv
[54]

ACL , year=

Exploring memorization in fine-tuned language models , author=. ACL , year=
[55]

arXiv preprint arXiv:2502.01187 , year=

Skewed memorization in large language models: Quantification and decomposition , author=. arXiv preprint arXiv:2502.01187 , year=

arXiv
[56]

Ranaldi, Leonardo and Ruzzetti, Elena Sofia and Zanzotto, Fabio Massimo Angelova, Galia , booktitle =
[57]

Journal of Machine Learning Research , volume=

Foundation models and fair use , author=. Journal of Machine Learning Research , volume=
[58]

INLG , year=

Preventing generation of verbatim memorization in language models gives a false sense of privacy , author=. INLG , year=
[59]

ACL , year=

Deduplicating training data makes language models better , author=. ACL , year=
[60]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv
[61]

arXiv preprint arXiv:2401.04088 , year=

Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

Pith/arXiv arXiv
[62]

EMNLP , year=

An empirical analysis of memorization in fine-tuned autoregressive language models , author=. EMNLP , year=
[63]

EMNLP , year=

Preserving privacy through dememorization: An unlearning technique for mitigating memorization risks in language models , author=. EMNLP , year=

[1] [1]

Code and datasets of this work , howpublished =

[2] [2]

WWW , year=

Generative large recommendation models: emerging trends in llms for recommendation , author=. WWW , year=

[3] [3]

CIKM , year=

Generative Recommendation with Semantic IDs: A Practitioner's Handbook , author=. CIKM , year=

[4] [4]

NeurIPS , year=

Recommender systems with generative retrieval , author=. NeurIPS , year=

[5] [5]

arXiv preprint arXiv:2510.24431 , year=

Minionerec: An open-source framework for scaling generative recommendation , author=. arXiv preprint arXiv:2510.24431 , year=

arXiv

[6] [6]

ACL , year=

Gram: Generative recommendation via semantic-aware multi-granular late fusion , author=. ACL , year=

[7] [7]

WWW , year=

Plum: Adapting pre-trained language models for industrial-scale generative recommendations , author=. WWW , year=

[8] [8]

arXiv preprint arXiv:2601.06798 , year=

Unleashing the Native Recommendation Potential: LLM-Based Generative Recommendation via Structured Term Identifiers , author=. arXiv preprint arXiv:2601.06798 , year=

arXiv

[9] [9]

arXiv preprint arXiv:2512.24762 , year=

OpenOneRec Technical Report , author=. arXiv preprint arXiv:2512.24762 , year=

arXiv

[10] [10]

ICDE , year=

Adapting large language models by integrating collaborative semantics for recommendation , author=. ICDE , year=

[11] [11]

WWW , year=

Eager-llm: Enhancing large language models as recommenders through exogenous behavior-semantic integration , author=. WWW , year=

[12] [12]

RecSys , year=

Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5) , author=. RecSys , year=

[13] [13]

NAACL , year=

Aligning large language models with recommendation knowledge , author=. NAACL , year=

[14] [14]

arXiv preprint arXiv:2303.08774 , year=

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

Pith/arXiv arXiv

[15] [15]

arXiv preprint arXiv:2302.13971 , year=

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

Pith/arXiv arXiv

[16] [16]

ACL , year=

Generative explore-exploit: Training-free optimization of generative recommender systems using llm optimizers , author=. ACL , year=

[17] [17]

EMNLP , year=

LOHRec: Leveraging Order and Hierarchy in Generative Sequential Recommendation , author=. EMNLP , year=

[18] [18]

NeurIPS , year=

Attention is all you need , author=. NeurIPS , year=

[19] [19]

LREC-COLING , year=

Large language models for generative recommendation: A survey and visionary discussions , author=. LREC-COLING , year=

[20] [20]

RecSys , year=

Semantic ids for joint generative search and recommendation , author=. RecSys , year=

[21] [21]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv

[22] [22]

ICDM , year=

Self-attentive sequential recommendation , author=. ICDM , year=

[23] [23]

arXiv preprint arXiv:2602.05945 , year=

AgenticTagger: Structured Item Representation for Recommendation with LLM Agents , author=. arXiv preprint arXiv:2602.05945 , year=

arXiv

[24] [24]

arXiv preprint arXiv:2506.05176 , year=

Qwen3 embedding: Advancing text embedding and reranking through foundation models , author=. arXiv preprint arXiv:2506.05176 , year=

Pith/arXiv arXiv

[25] [25]

IEEE Transactions on Audio, Speech and Language Processing , volume=

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-Tuning , author=. IEEE Transactions on Audio, Speech and Language Processing , volume=

[26] [26]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , year=

[27] [27]

WWW , year=

Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering , author=. WWW , year=

[28] [28]

SIGIR , year=

Lightgcn: Simplifying and powering graph convolution network for recommendation , author=. SIGIR , year=

[29] [29]

SIGIR , year=

Are graph augmentations necessary? simple graph contrastive learning for recommendation , author=. SIGIR , year=

[30] [30]

, author=

Feature-level deeper self-attention network for sequential recommendation. , author=. IJCAI , year=

[31] [31]

CIKM , year=

S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization , author=. CIKM , year=

[32] [32]

CIKM , year=

Learnable item tokenization for generative recommendation , author=. CIKM , year=

[33] [33]

CVPR , year=

Autoregressive image generation using residual quantization , author=. CVPR , year=

[34] [34]

SIGIR , year=

Idgenrec: Llm-recsys alignment with textual id learning , author=. SIGIR , year=

[35] [35]

ACL , year=

Llamafactory: Unified efficient fine-tuning of 100+ language models , author=. ACL , year=

[36] [36]

ICLR , year=

Decoupled weight decay regularization , author=. ICLR , year=

[37] [37]

COLING , year=

Learning transition patterns by large language models for sequential recommendation , author=. COLING , year=

[38] [38]

ICML , year=

Actionpiece: Contextually tokenizing action sequences for generative recommendation , author=. ICML , year=

[39] [39]

ICLR , year=

Efficient inference for large language model-based generative recommendation , author=. ICLR , year=

[40] [40]

WWW , year=

Item-based collaborative filtering recommendation algorithms , author=. WWW , year=

[41] [41]

arXiv preprint arXiv:2510.11639 , year=

Onerec-think: In-text reasoning for generative recommendation , author=. arXiv preprint arXiv:2510.11639 , year=

arXiv

[42] [42]

arXiv preprint arXiv:2603.17540 , year=

Deploying Semantic ID-based Generative Retrieval for Large-Scale Podcast Discovery at Spotify , author=. arXiv preprint arXiv:2603.17540 , year=

arXiv

[43] [43]

KDD , year=

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters , author=. KDD , year=

[44] [44]

arXiv preprint arXiv:2507.06261 , year=

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

Pith/arXiv arXiv

[45] [45]

NeurIPS , year=

Sequence to sequence learning with neural networks , author=. NeurIPS , year=

[46] [46]

RecSys , year=

Beyond accuracy: evaluating recommender systems by coverage and serendipity , author=. RecSys , year=

[47] [47]

KDD , year=

Inferring networks of substitutable and complementary products , author=. KDD , year=

[48] [48]

AAAI , year=

Align ^3 GR: Unified Multi-Level Alignment for LLM-based Generative Recommendation , author=. AAAI , year=

[49] [49]

arXiv preprint arXiv:2509.25522 , year=

Understanding generative recommendation with semantic ids from a model-scaling view , author=. arXiv preprint arXiv:2509.25522 , year=

Pith/arXiv arXiv

[50] [50]

Communications of the ACM , volume=

Shortcut learning of large language models in natural language understanding , author=. Communications of the ACM , volume=. 2023 , publisher=

2023

[51] [51]

ICLR , year=

Assessing robustness to spurious correlations in post-training language models , author=. ICLR , year=

[52] [52]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Do llms overcome shortcut learning? an evaluation of shortcut challenges in large language models , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024

[53] [53]

arXiv preprint arXiv:2410.02650 , year=

Undesirable memorization in large language models: A survey , author=. arXiv preprint arXiv:2410.02650 , year=

arXiv

[54] [54]

ACL , year=

Exploring memorization in fine-tuned language models , author=. ACL , year=

[55] [55]

arXiv preprint arXiv:2502.01187 , year=

Skewed memorization in large language models: Quantification and decomposition , author=. arXiv preprint arXiv:2502.01187 , year=

arXiv

[56] [56]

Ranaldi, Leonardo and Ruzzetti, Elena Sofia and Zanzotto, Fabio Massimo Angelova, Galia , booktitle =

[57] [57]

Journal of Machine Learning Research , volume=

Foundation models and fair use , author=. Journal of Machine Learning Research , volume=

[58] [58]

INLG , year=

Preventing generation of verbatim memorization in language models gives a false sense of privacy , author=. INLG , year=

[59] [59]

ACL , year=

Deduplicating training data makes language models better , author=. ACL , year=

[60] [60]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv

[61] [61]

arXiv preprint arXiv:2401.04088 , year=

Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

Pith/arXiv arXiv

[62] [62]

EMNLP , year=

An empirical analysis of memorization in fine-tuned autoregressive language models , author=. EMNLP , year=

[63] [63]

EMNLP , year=

Preserving privacy through dememorization: An unlearning technique for mitigating memorization risks in language models , author=. EMNLP , year=