Diagnosing and Mitigating Retrieval Bottlenecks in LLM-Based Cold-Start Recommendation

(2) Stanford University; (3) Independent Researcher); Fang Qin (2); Manish Shah (3); Yicheng Wang (3) ((1) University of Maine at Presque Isle; Zhe Dong (1)

arxiv: 2606.29947 · v1 · pith:Q4IKQFD4new · submitted 2026-06-29 · 💻 cs.IR · cs.LG

Diagnosing and Mitigating Retrieval Bottlenecks in LLM-Based Cold-Start Recommendation

Zhe Dong (1) , Fang Qin (2) , Manish Shah (3) , Yicheng Wang (3) ((1) University of Maine at Presque Isle , (2) Stanford University , (3) Independent Researcher) This is my paper

Pith reviewed 2026-06-30 04:27 UTC · model grok-4.3

classification 💻 cs.IR cs.LG

keywords cold-start recommendationLLM rerankersretrieval coveragehybrid fusionrecommender systemsinformation retrievallong-tail items

0 comments

The pith

Retrieval coverage is the primary bottleneck for LLM-based cold-start recommendation, not reranker quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models improve cold-start recommendation when used as rerankers. It uses a five-domain benchmark that separates the quality of reranking from how well the initial retrieval finds the target item. In settings where the correct item is forced into the candidate pool, LLMs do not reliably beat strong baselines, even with larger models. In more realistic settings, standard retrievers rarely include the target item at all because many cold-start items are completely new. The authors introduce a hybrid fusion method for retrieval that improves coverage but still leaves LLM reranking underperforming compared to non-LLM approaches.

Core claim

In retrieval-realistic conditions, standard retrievers place the gold item in a 200-item pool only 4.6-22.9% of the time due to 32-91% of targets being brand-new items, while calibrated LLM rerankers fail to consistently outperform baselines even when the item is present; a learned hybrid fusion layer over multi-retriever pools improves coverage but learned non-LLM ranking exploits it better than prompt-level LLM reranking.

What carries the argument

LHF, a validation-trained learned hybrid fusion layer over a multi-retriever union pool, which combines retrieval signals to increase the chance the gold item appears in the candidate set.

If this is right

LLM rerankers do not consistently beat collaborative and content baselines in positive-controlled regimes across five domains.
Single retrievers achieve low coverage of cold-start targets in realistic regimes.
LHF is the only tested combiner that beats every single retriever on all domains and recovers 17-61% of oracle coverage on content-rich domains.
End-to-end, non-LLM ranking on LHF pools outperforms LLM reranking on the same pools.
LLMs show semantic advantages mainly in text-rich domains when the item is already retrieved.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Recommender pipelines may benefit more from investing in multi-retriever coverage than in LLM reranking layers.
Future work could test whether fine-tuning LLMs specifically for ranking on these pools closes the gap.
Domains with strong collaborative signals may need different retrieval strategies than text-rich ones.

Load-bearing premise

The positive-controlled and retrieval-realistic regimes on the five-domain benchmark isolate reranking performance from retrieval coverage without being affected by domain selection or prompt choices.

What would settle it

An experiment showing a single retriever achieving over 50% gold-item coverage in the retrieval-realistic regime on multiple domains, or an LLM reranker that consistently outperforms all baselines in the positive-controlled regime across domains.

Figures

Figures reproduced from arXiv: 2606.29947 by (2) Stanford University, (3) Independent Researcher), Fang Qin (2), Manish Shah (3), Yicheng Wang (3) ((1) University of Maine at Presque Isle, Zhe Dong (1).

**Figure 2.** Figure 2: Retriever complementarity is large, but difficult to realize. LHF is the only combiner that [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Coverage-aware training exposes a regime conflict. Upweighting item-new positives [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: The LHF pool contains rankable signal, but the prompt-level LLM does not exploit it. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Candidate-generation Recall@K (gold-in-pool coverage at cutoff K) for realizable rankers. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: LHF ablations. Removing text retrievers or cold-start metadata severely hurts the [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used as rerankers in recommender systems, with the expectation that semantic understanding will help in cold-start and long-tail regimes. We test this assumption with a five-domain benchmark that explicitly separates reranking quality from retrieval coverage. In a positive-controlled regime where the gold item is guaranteed present, calibrated LLM rerankers fail to consistently outperform strong collaborative and content baselines under natural traffic, and within-family scaling from Qwen3-8B to Qwen3-32B narrows but does not close the gap on most domains. In a retrieval-realistic regime where the gold item is not injected, the bottleneck is more severe: standard single retrievers place the gold item in a 200-item pool only 4.6-22.9% of the time, largely because 32-91% of cold-start targets are brand-new items with no training interactions. We introduce LHF, a validation-trained learned hybrid fusion layer over a multi-retriever union pool, as a retrieval-side realizability baseline. LHF is the only combiner we test that beats every single retriever on all five domains and recovers 17-61% of oracle coverage headroom on content-rich domains, but only 5-7% on collaboratively strong domains. End-to-end experiments reveal the remaining mismatch: learned non-LLM ranking exploits the LHF pool, while prompt-level LLM reranking often degrades it. LLMs exhibit pockets of semantic cold-start advantage, especially in text-rich domains when the item is already present, but this advantage is largely unreachable in current retrieve-then-rerank pipelines. We release the benchmark protocol, splits, prompts, evaluation tooling, and archived reproducibility artifacts: data at https://doi.org/10.5281/zenodo.20991039 and code at https://doi.org/10.5281/zenodo.20993306.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Retrieval coverage is the real limiter in cold-start recs, and LLMs add little as rerankers even when the item is present.

read the letter

The paper shows that standard retrievers miss the target item most of the time in cold-start settings, and that LLM rerankers do not beat strong baselines consistently even when the gold item is forced into the pool.

They separate the issues with two regimes across five domains. In the positive-controlled regime the gold item is injected, yet calibrated Qwen models still trail collaborative and content baselines on most domains, with scaling from 8B to 32B narrowing but not closing the gap. In the realistic regime, single retrievers place the target in a 200-item pool only 4.6-22.9% of the time, largely because 32-91% of cold targets have no training interactions at all. Their LHF learned hybrid fusion over a multi-retriever union is the only combiner that beats every single retriever on all domains and recovers a useful fraction of oracle coverage on content-rich domains.

The work does two things well. It releases the full benchmark protocol, splits, prompts, and archived artifacts, which lets others verify the coverage numbers and LHF results. The patterns are reported as consistent across domains, which gives the retrieval-bottleneck claim some weight.

A soft spot is whether the benchmark cleanly isolates reranking quality. The share of brand-new items varies sharply by domain, and the abstract gives no explicit ablations on prompt engineering effort or calibration choices relative to the baselines. If those factors differ, the observed LLM underperformance could partly reflect setup rather than a general limit. The stress-test note flags this, and it looks like a real question to check.

This is for people building or evaluating retrieve-then-rerank pipelines in recommendation, especially anyone weighing LLMs for cold-start. The coverage statistics and LHF baseline are the parts that will travel.

It deserves peer review. The empirical separation and released materials are solid enough to warrant referee time even if the LLM conclusion needs tighter controls.

Referee Report

1 major / 0 minor

Summary. The manuscript presents a five-domain benchmark that separates positive-controlled (gold item guaranteed in pool) and retrieval-realistic regimes to diagnose bottlenecks in LLM reranking for cold-start recommendation. It reports that calibrated LLM rerankers do not consistently outperform collaborative and content baselines even when the target is present; in the realistic regime, single retrievers recover the gold item in a 200-item pool only 4.6-22.9% of the time, driven by 32-91% brand-new items with no training interactions. The authors introduce LHF, a validation-trained learned hybrid fusion over multi-retriever unions, which improves coverage over single retrievers on all domains and recovers 17-61% of oracle headroom on content-rich domains (5-7% on collaborative ones). End-to-end results show learned non-LLM ranking benefits from the LHF pool while prompt-based LLM reranking often degrades it, with limited semantic advantages for LLMs in text-rich domains when items are already retrieved. The work releases the benchmark protocol, splits, prompts, evaluation tooling, data, and code.

Significance. If the regime separation holds without domain-selection or prompt-optimization artifacts, the results establish retrieval coverage as the dominant bottleneck in cold-start LLM pipelines and show that LLM semantic advantages remain largely unreachable in standard retrieve-then-rerank setups. The explicit positive-controlled versus realistic comparison, the LHF baseline, and the released artifacts (data at doi:10.5281/zenodo.20991039, code at doi:10.5281/zenodo.20993306) constitute a concrete, reusable contribution that can standardize evaluation in this area.

major comments (1)

[Abstract] Abstract: The central claim that the five-domain benchmark 'explicitly separates reranking quality from retrieval coverage' is load-bearing for the conclusion that LLM underperformance is not an artifact. However, the reported 32-91% brand-new item rates vary sharply by domain; without explicit ablations on domain-specific cold-start item selection or equalized prompt-engineering effort versus baselines, it remains possible that text-rich domains embed selection effects that favor semantic matching and thereby inflate the apparent retrieval bottleneck relative to collaborative domains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the single major comment below.

read point-by-point responses

Referee: The central claim that the five-domain benchmark 'explicitly separates reranking quality from retrieval coverage' is load-bearing for the conclusion that LLM underperformance is not an artifact. However, the reported 32-91% brand-new item rates vary sharply by domain; without explicit ablations on domain-specific cold-start item selection or equalized prompt-engineering effort versus baselines, it remains possible that text-rich domains embed selection effects that favor semantic matching and thereby inflate the apparent retrieval bottleneck relative to collaborative domains.

Authors: The regime separation is implemented by construction within each domain: the positive-controlled regime injects the gold item into every candidate pool, enabling measurement of reranking quality conditional on presence, while the realistic regime uses unmodified retrieval output. This within-domain contrast holds irrespective of cross-domain variation in brand-new item rates (which we report transparently as a domain characteristic). The design therefore isolates the two factors without requiring additional domain-specific selection ablations. On prompting, we apply the same calibrated prompt templates and few-shot examples to all LLM rerankers across domains (see Section 4.2); equalizing optimization effort against non-LLM baselines would require an orthogonal experimental axis that falls outside the scope of evaluating standard retrieve-then-rerank pipelines. The observed LLM underperformance in the positive-controlled regime is therefore not an artifact of unequal tuning. We see no need to revise the manuscript on this point. revision: no

Circularity Check

0 steps flagged

No circularity; purely empirical benchmark study

full rationale

The paper reports experimental results from a five-domain benchmark comparing LLM rerankers against baselines in controlled and realistic retrieval regimes. All central claims (e.g., retrieval coverage rates of 4.6-22.9%, LHF recovering 17-61% headroom) are direct measurements or comparisons on held-out data, with no equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations. The derivation chain is absent; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper is an empirical study; no free parameters, mathematical axioms, or postulated physical entities are invoked beyond standard ML training assumptions. LHF is a methodological contribution rather than an invented entity with independent evidence.

invented entities (1)

LHF no independent evidence
purpose: Learned hybrid fusion layer to combine multi-retriever pools for better cold-start coverage
Introduced as a new baseline method; no external falsifiable evidence provided in abstract.

pith-pipeline@v0.9.1-grok · 5923 in / 1269 out tokens · 78221 ms · 2026-06-30T04:27:45.728478+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 18 canonical work pages

[1]

Recommendation as language processing (rlp): A unified pretrain, personalized prompt and predict paradigm (p5)

Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. Recommendation as language processing (rlp): A unified pretrain, personalized prompt and predict paradigm (p5). InProceedings of the 16th ACM Conference on Recommender Systems, pages 299–315,
[2]

doi: 10.1145/3523227.3546767

work page doi:10.1145/3523227.3546767
[3]

Tallrec: An effective and efficient tuning framework to align large language model with recommendation

Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. InProceedings of the 17th ACM Conference on Recommender Systems, pages 1007–1014, 2023. doi: 10.1145/3604915.3608857

work page doi:10.1145/3604915.3608857 2023
[4]

Large language models are zero-shot rankers for recommender systems

Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian McAuley, and Wayne Xin Zhao. Large language models are zero-shot rankers for recommender systems. In Advances in Information Retrieval: 46th European Conference on Information Retrieval, pages 364–381, 2024

2024
[5]

Recranker: Instruction tuning large language model as ranker for top-k recommendation.arXiv preprint arXiv:2312.16018, 2024

Sichun Luo, Bowei He, Haohan Zhao, Wei Shao, Yanlin Qi, Yinya Huang, Aojun Zhou, Yuxuan Yao, Zongpeng Li, Yuanzhang Xiao, Mingjie Zhan, and Linqi Song. Recranker: Instruction tuning large language model as ranker for top-k recommendation.arXiv preprint arXiv:2312.16018, 2024

arXiv 2024
[6]

Is chatgpt a good recommender? a preliminary study.arXiv preprint arXiv:2304.10149, 2023

Junling Liu, Chao Liu, Peilin Zhou, Renjie Lv, Kang Zhou, and Yan Zhang. Is chatgpt a good recommender? a preliminary study.arXiv preprint arXiv:2304.10149, 2023

arXiv 2023
[7]

A survey on large language models for recommendation.World Wide Web, 27:1–49, 2024

Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, Hui Xiong, and Enhong Chen. A survey on large language models for recommendation.World Wide Web, 27:1–49, 2024. doi: 10.1007/s11280-024-01291-2

work page doi:10.1007/s11280-024-01291-2 2024
[8]

How can recommender systems benefit from large language models: A survey.ACM Transactions on Information Systems, 43(2):28:1–28:47, 2025

Jianghao Lin, Xinyi Dai, Yunjia Xi, Weiwen Liu, Bo Chen, Hao Zhang, Yong Liu, Chuhan Wu, Xiangyang Li, Chenxu Zhu, Huifeng Guo, Yong Yu, Ruiming Tang, and Weinan Zhang. How can recommender systems benefit from large language models: A survey.ACM Transactions on Information Systems, 43(2):28:1–28:47, 2025. doi: 10.1145/3678004

work page doi:10.1145/3678004 2025
[9]

Dropoutnet: Addressing cold start in recommender systems

Maksims Volkovs, Guang Wei Yu, and Tomi Poutanen. Dropoutnet: Addressing cold start in recommender systems. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017
[10]

Learning to warm up cold item embeddings for cold-start recommendation with 14 meta scaling and shifting networks

Yongchun Zhu, Ruobing Xie, Fuzhen Zhuang, Kaikai Ge, Ying Sun, Xu Zhang, Leyu Lin, and Juan Cao. Learning to warm up cold item embeddings for cold-start recommendation with 14 meta scaling and shifting networks. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1167–1176, 2021. doi: 10...

work page doi:10.1145/3404835.3462843 2021
[11]

Warm up cold-start advertisements: Improving ctr predictions via learning to learn id embeddings

Feiyang Pan, Shuokai Li, Xiang Ao, Pingzhong Tang, and Qing He. Warm up cold-start advertisements: Improving ctr predictions via learning to learn id embeddings. InProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 695–704, 2019. doi: 10.1145/3331184.3331268

work page doi:10.1145/3331184.3331268 2019
[12]

Melu: Meta-learned user preference estimator for cold-start recommendation

Hoyeop Lee, Jinbae Im, Seongwon Jang, Hyunsouk Cho, and Sehee Chung. Melu: Meta-learned user preference estimator for cold-start recommendation. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1073–1082,
[13]

doi: 10.1145/3292500.3330859

work page doi:10.1145/3292500.3330859
[14]

Con- trastive learning for cold-start recommendation

Yinwei Wei, Xiang Wang, Qi Li, Liqiang Nie, Yan Li, Xuanping Li, and Tat-Seng Chua. Con- trastive learning for cold-start recommendation. InProceedings of the 29th ACM International Conference on Multimedia, pages 5382–5390, 2021. doi: 10.1145/3474085.3475665

work page doi:10.1145/3474085.3475665 2021
[15]

Vbpr: Visual bayesian personalized ranking from implicit feedback

Ruining He and Julian McAuley. Vbpr: Visual bayesian personalized ranking from implicit feedback. InProceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pages 144–150, 2016

2016
[16]

Bpr: Bayesian personalized ranking from implicit feedback

Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. InProceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 452–461, 2009

2009
[17]

Neural graph collaborative filtering

Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. Neural graph collaborative filtering. InProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 165–174, 2019. doi: 10.1145/3331184. 3331267

work page doi:10.1145/3331184 2019
[18]

Lightgcn: Simplifying and powering graph convolution network for recommendation

Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. Lightgcn: Simplifying and powering graph convolution network for recommendation. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 639–648, 2020. doi: 10.1145/3397271.3401063

work page doi:10.1145/3397271.3401063 2020
[19]

Self-attentive sequential recommendation

Wang-Cheng Kang and Julian McAuley. Self-attentive sequential recommendation. In2018 IEEE International Conference on Data Mining, pages 197–206, 2018

2018
[20]

In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3982–3992, 2019. doi: 10.18653/v1/D19-1410

work page doi:10.18653/v1/d19-1410 2019
[21]

C-pack: Packed resources for general chinese embeddings.arXiv preprint arXiv:2309.07597, 2023

Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. C-pack: Packed resources for general chinese embeddings.arXiv preprint arXiv:2309.07597, 2023

Pith/arXiv arXiv 2023
[22]

Bge m3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216, 2024

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216, 2024. 15

Pith/arXiv arXiv 2024
[23]

Text embeddings by weakly-supervised contrastive pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

Pith/arXiv arXiv 2022
[24]

Dense passage retrieval for open-domain ques- tion answering,

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 6769–6781, 2020. doi: 10.18653/v1/2020.emnlp-main.550

work page doi:10.18653/v1/2020.emnlp-main.550 2020
[25]

Xinyang Yi, Ji Yang, Lichan Hong, Derek Zhiyuan Cheng, Lukasz Heldt, Aditee Kumthekar, Zhe Zhao, Li Wei, and Ed H. Chi. Sampling-bias-corrected neural modeling for large corpus item recommendations. InProceedings of the 13th ACM Conference on Recommender Systems, pages 269–277, 2019. doi: 10.1145/3298689.3346996

work page doi:10.1145/3298689.3346996 2019
[26]

Cormack, Charles L

Gordon V. Cormack, Charles L. A. Clarke, and Stefan Buettcher. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. InProceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 758–759, 2009. doi: 10.1145/1571941.1572114

work page doi:10.1145/1571941.1572114 2009
[27]

Lightgbm: A highly efficient gradient boosting decision tree

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017
[28]

Are we really making much progress? a worrying analysis of recent neural recommendation approaches

Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. Are we really making much progress? a worrying analysis of recent neural recommendation approaches. InProceedings of the 13th ACM Conference on Recommender Systems, pages 101–109, 2019. doi: 10.1145/ 3298689.3347058

arXiv 2019
[29]

A troubling analysis of reproducibility and progress in recommender systems research.ACM Transactions on Information Systems, 39(2):20:1–20:49, 2021

Maurizio Ferrari Dacrema, Simone Boglio, Paolo Cremonesi, and Dietmar Jannach. A troubling analysis of reproducibility and progress in recommender systems research.ACM Transactions on Information Systems, 39(2):20:1–20:49, 2021. doi: 10.1145/3434185

work page doi:10.1145/3434185 2021
[30]

On target item sampling in offline recommender system evaluation

Rocio Canamares and Pablo Castells. On target item sampling in offline recommender system evaluation. InProceedings of the 14th ACM Conference on Recommender Systems, pages 259–268, 2020. doi: 10.1145/3383313.3412259

work page doi:10.1145/3383313.3412259 2020
[31]

Diagnosing llm-based rerankers in cold- start recommender systems: Coverage, exposure and practical mitigations.arXiv preprint arXiv:2604.16318, 2026

Ekaterina Lemdiasova and Nikita Zmanovskii. Diagnosing llm-based rerankers in cold- start recommender systems: Coverage, exposure and practical mitigations.arXiv preprint arXiv:2604.16318, 2026

Pith/arXiv arXiv 2026
[32]

Bridging language and items for retrieval and recommendation.arXiv preprint arXiv:2403.03952, 2024

Yupeng Hou, Jiacheng Li, Zhankui He, An Yan, Xiusi Chen, and Julian McAuley. Bridging language and items for retrieval and recommendation.arXiv preprint arXiv:2403.03952, 2024

Pith/arXiv arXiv 2024
[33]

Amazon Reviews 2023.https://amazon-reviews-2023.github.io/, 2023

McAuley Lab. Amazon Reviews 2023.https://amazon-reviews-2023.github.io/, 2023

2023
[34]

Mind: A large-scale dataset for news recommendation

Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu, and Ming Zhou. Mind: A large-scale dataset for news recommendation. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3597–3606, 2020. doi: 10.18653/v1/2020.acl-main.331

work page doi:10.18653/v1/2020.acl-main.331 2020
[35]

Maxwell Harper and Joseph A

F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context.ACM Transactions on Interactive Intelligent Systems, 5(4):19:1–19:19, 2015. doi: 10.1145/2827872. 16

work page doi:10.1145/2827872 2015
[36]

Yelp Open Dataset.https://www.yelp.com/dataset, 2024

Yelp Inc. Yelp Open Dataset.https://www.yelp.com/dataset, 2024

2024
[37]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[38]

Llama 3.3 70B Instruct Model Card

Meta AI. Llama 3.3 70B Instruct Model Card. https://huggingface.co/meta-llama/ Llama-3.3-70B-Instruct, 2024. 17

2024

[1] [1]

Recommendation as language processing (rlp): A unified pretrain, personalized prompt and predict paradigm (p5)

Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. Recommendation as language processing (rlp): A unified pretrain, personalized prompt and predict paradigm (p5). InProceedings of the 16th ACM Conference on Recommender Systems, pages 299–315,

[2] [2]

doi: 10.1145/3523227.3546767

work page doi:10.1145/3523227.3546767

[3] [3]

Tallrec: An effective and efficient tuning framework to align large language model with recommendation

Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. InProceedings of the 17th ACM Conference on Recommender Systems, pages 1007–1014, 2023. doi: 10.1145/3604915.3608857

work page doi:10.1145/3604915.3608857 2023

[4] [4]

Large language models are zero-shot rankers for recommender systems

Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian McAuley, and Wayne Xin Zhao. Large language models are zero-shot rankers for recommender systems. In Advances in Information Retrieval: 46th European Conference on Information Retrieval, pages 364–381, 2024

2024

[5] [5]

Recranker: Instruction tuning large language model as ranker for top-k recommendation.arXiv preprint arXiv:2312.16018, 2024

Sichun Luo, Bowei He, Haohan Zhao, Wei Shao, Yanlin Qi, Yinya Huang, Aojun Zhou, Yuxuan Yao, Zongpeng Li, Yuanzhang Xiao, Mingjie Zhan, and Linqi Song. Recranker: Instruction tuning large language model as ranker for top-k recommendation.arXiv preprint arXiv:2312.16018, 2024

arXiv 2024

[6] [6]

Is chatgpt a good recommender? a preliminary study.arXiv preprint arXiv:2304.10149, 2023

Junling Liu, Chao Liu, Peilin Zhou, Renjie Lv, Kang Zhou, and Yan Zhang. Is chatgpt a good recommender? a preliminary study.arXiv preprint arXiv:2304.10149, 2023

arXiv 2023

[7] [7]

A survey on large language models for recommendation.World Wide Web, 27:1–49, 2024

Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, Hui Xiong, and Enhong Chen. A survey on large language models for recommendation.World Wide Web, 27:1–49, 2024. doi: 10.1007/s11280-024-01291-2

work page doi:10.1007/s11280-024-01291-2 2024

[8] [8]

How can recommender systems benefit from large language models: A survey.ACM Transactions on Information Systems, 43(2):28:1–28:47, 2025

Jianghao Lin, Xinyi Dai, Yunjia Xi, Weiwen Liu, Bo Chen, Hao Zhang, Yong Liu, Chuhan Wu, Xiangyang Li, Chenxu Zhu, Huifeng Guo, Yong Yu, Ruiming Tang, and Weinan Zhang. How can recommender systems benefit from large language models: A survey.ACM Transactions on Information Systems, 43(2):28:1–28:47, 2025. doi: 10.1145/3678004

work page doi:10.1145/3678004 2025

[9] [9]

Dropoutnet: Addressing cold start in recommender systems

Maksims Volkovs, Guang Wei Yu, and Tomi Poutanen. Dropoutnet: Addressing cold start in recommender systems. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017

[10] [10]

Learning to warm up cold item embeddings for cold-start recommendation with 14 meta scaling and shifting networks

Yongchun Zhu, Ruobing Xie, Fuzhen Zhuang, Kaikai Ge, Ying Sun, Xu Zhang, Leyu Lin, and Juan Cao. Learning to warm up cold item embeddings for cold-start recommendation with 14 meta scaling and shifting networks. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1167–1176, 2021. doi: 10...

work page doi:10.1145/3404835.3462843 2021

[11] [11]

Warm up cold-start advertisements: Improving ctr predictions via learning to learn id embeddings

Feiyang Pan, Shuokai Li, Xiang Ao, Pingzhong Tang, and Qing He. Warm up cold-start advertisements: Improving ctr predictions via learning to learn id embeddings. InProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 695–704, 2019. doi: 10.1145/3331184.3331268

work page doi:10.1145/3331184.3331268 2019

[12] [12]

Melu: Meta-learned user preference estimator for cold-start recommendation

Hoyeop Lee, Jinbae Im, Seongwon Jang, Hyunsouk Cho, and Sehee Chung. Melu: Meta-learned user preference estimator for cold-start recommendation. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1073–1082,

[13] [13]

doi: 10.1145/3292500.3330859

work page doi:10.1145/3292500.3330859

[14] [14]

Con- trastive learning for cold-start recommendation

Yinwei Wei, Xiang Wang, Qi Li, Liqiang Nie, Yan Li, Xuanping Li, and Tat-Seng Chua. Con- trastive learning for cold-start recommendation. InProceedings of the 29th ACM International Conference on Multimedia, pages 5382–5390, 2021. doi: 10.1145/3474085.3475665

work page doi:10.1145/3474085.3475665 2021

[15] [15]

Vbpr: Visual bayesian personalized ranking from implicit feedback

Ruining He and Julian McAuley. Vbpr: Visual bayesian personalized ranking from implicit feedback. InProceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pages 144–150, 2016

2016

[16] [16]

Bpr: Bayesian personalized ranking from implicit feedback

Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. InProceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 452–461, 2009

2009

[17] [17]

Neural graph collaborative filtering

Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. Neural graph collaborative filtering. InProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 165–174, 2019. doi: 10.1145/3331184. 3331267

work page doi:10.1145/3331184 2019

[18] [18]

Lightgcn: Simplifying and powering graph convolution network for recommendation

Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. Lightgcn: Simplifying and powering graph convolution network for recommendation. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 639–648, 2020. doi: 10.1145/3397271.3401063

work page doi:10.1145/3397271.3401063 2020

[19] [19]

Self-attentive sequential recommendation

Wang-Cheng Kang and Julian McAuley. Self-attentive sequential recommendation. In2018 IEEE International Conference on Data Mining, pages 197–206, 2018

2018

[20] [20]

In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3982–3992, 2019. doi: 10.18653/v1/D19-1410

work page doi:10.18653/v1/d19-1410 2019

[21] [21]

C-pack: Packed resources for general chinese embeddings.arXiv preprint arXiv:2309.07597, 2023

Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. C-pack: Packed resources for general chinese embeddings.arXiv preprint arXiv:2309.07597, 2023

Pith/arXiv arXiv 2023

[22] [22]

Bge m3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216, 2024

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216, 2024. 15

Pith/arXiv arXiv 2024

[23] [23]

Text embeddings by weakly-supervised contrastive pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

Pith/arXiv arXiv 2022

[24] [24]

Dense passage retrieval for open-domain ques- tion answering,

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 6769–6781, 2020. doi: 10.18653/v1/2020.emnlp-main.550

work page doi:10.18653/v1/2020.emnlp-main.550 2020

[25] [25]

Xinyang Yi, Ji Yang, Lichan Hong, Derek Zhiyuan Cheng, Lukasz Heldt, Aditee Kumthekar, Zhe Zhao, Li Wei, and Ed H. Chi. Sampling-bias-corrected neural modeling for large corpus item recommendations. InProceedings of the 13th ACM Conference on Recommender Systems, pages 269–277, 2019. doi: 10.1145/3298689.3346996

work page doi:10.1145/3298689.3346996 2019

[26] [26]

Cormack, Charles L

Gordon V. Cormack, Charles L. A. Clarke, and Stefan Buettcher. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. InProceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 758–759, 2009. doi: 10.1145/1571941.1572114

work page doi:10.1145/1571941.1572114 2009

[27] [27]

Lightgbm: A highly efficient gradient boosting decision tree

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017

[28] [28]

Are we really making much progress? a worrying analysis of recent neural recommendation approaches

Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. Are we really making much progress? a worrying analysis of recent neural recommendation approaches. InProceedings of the 13th ACM Conference on Recommender Systems, pages 101–109, 2019. doi: 10.1145/ 3298689.3347058

arXiv 2019

[29] [29]

A troubling analysis of reproducibility and progress in recommender systems research.ACM Transactions on Information Systems, 39(2):20:1–20:49, 2021

Maurizio Ferrari Dacrema, Simone Boglio, Paolo Cremonesi, and Dietmar Jannach. A troubling analysis of reproducibility and progress in recommender systems research.ACM Transactions on Information Systems, 39(2):20:1–20:49, 2021. doi: 10.1145/3434185

work page doi:10.1145/3434185 2021

[30] [30]

On target item sampling in offline recommender system evaluation

Rocio Canamares and Pablo Castells. On target item sampling in offline recommender system evaluation. InProceedings of the 14th ACM Conference on Recommender Systems, pages 259–268, 2020. doi: 10.1145/3383313.3412259

work page doi:10.1145/3383313.3412259 2020

[31] [31]

Diagnosing llm-based rerankers in cold- start recommender systems: Coverage, exposure and practical mitigations.arXiv preprint arXiv:2604.16318, 2026

Ekaterina Lemdiasova and Nikita Zmanovskii. Diagnosing llm-based rerankers in cold- start recommender systems: Coverage, exposure and practical mitigations.arXiv preprint arXiv:2604.16318, 2026

Pith/arXiv arXiv 2026

[32] [32]

Bridging language and items for retrieval and recommendation.arXiv preprint arXiv:2403.03952, 2024

Yupeng Hou, Jiacheng Li, Zhankui He, An Yan, Xiusi Chen, and Julian McAuley. Bridging language and items for retrieval and recommendation.arXiv preprint arXiv:2403.03952, 2024

Pith/arXiv arXiv 2024

[33] [33]

Amazon Reviews 2023.https://amazon-reviews-2023.github.io/, 2023

McAuley Lab. Amazon Reviews 2023.https://amazon-reviews-2023.github.io/, 2023

2023

[34] [34]

Mind: A large-scale dataset for news recommendation

Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu, and Ming Zhou. Mind: A large-scale dataset for news recommendation. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3597–3606, 2020. doi: 10.18653/v1/2020.acl-main.331

work page doi:10.18653/v1/2020.acl-main.331 2020

[35] [35]

Maxwell Harper and Joseph A

F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context.ACM Transactions on Interactive Intelligent Systems, 5(4):19:1–19:19, 2015. doi: 10.1145/2827872. 16

work page doi:10.1145/2827872 2015

[36] [36]

Yelp Open Dataset.https://www.yelp.com/dataset, 2024

Yelp Inc. Yelp Open Dataset.https://www.yelp.com/dataset, 2024

2024

[37] [37]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[38] [38]

Llama 3.3 70B Instruct Model Card

Meta AI. Llama 3.3 70B Instruct Model Card. https://huggingface.co/meta-llama/ Llama-3.3-70B-Instruct, 2024. 17

2024