Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

An Yan, Jiacheng Li, Julian McAuley, Xiangjun Fu, Xiusi Chen, Yupeng Hou, Zhankui He

Authors on Pith no claims yet

classification 💻 cs.IR

keywords recommendationllmssemanticencodersbenchmarkblairitemslanguage

read the original abstract

Feature engineering has long been central to recommender systems, yet effectively leveraging textual item features remains challenging. Recent advances in large language models (LLMs) have enabled their use as semantic encoders for recommendation, but their roles and behaviors in this setting are still not well understood. Prior studies often rely on general-purpose embedding benchmarks (e.g., MTEB) when selecting LLMs, overlooking the unique characteristics of recommendation tasks. To address this gap, we introduce BLaIR, a comprehensive benchmark for evaluating LLMs as semantic encoders in recommendation scenarios. We contribute (1) a new large-scale Amazon Reviews 2023 dataset with over 570 million reviews and 48 million items, (2) a unified benchmark covering sequential recommendation, collaborative filtering, and product search, and (3) a new complex-query product search task featuring both semi-synthetic and real-world evaluation datasets. Experiments with 11 leading LLMs show that their rankings on BLaIR show little correlation with MTEB, highlighting the unique challenges of semantic encoding in recommendation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Towards Robust Federated Multimodal Graph Learning under Modality Heterogeneity
cs.LG 2026-05 unverdicted novelty 7.0

FedMPO recovers missing modalities via topology-aware generation, filters noisy recoveries with missing-aware routing, and uses reliability-aware aggregation to achieve up to 5.65% gains over baselines in high-missing...
fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery
cs.LG 2026-05 conditional novelty 7.0

fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent...
FraudBench: A Multimodal Benchmark for Detecting AI-Generated Fraudulent Refund Evidence
cs.CV 2026-05 unverdicted novelty 7.0

FraudBench shows that current multimodal LLMs and specialized AI-image detectors often fail to spot AI-generated fake damage in refund evidence, with true positive rates frequently below 50% on synthetic subsets while...
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs
cs.LG 2026-05 unverdicted novelty 7.0

On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
Expressiveness Limits of Autoregressive Semantic ID Generation in Generative Recommendation
cs.IR 2026-05 unverdicted novelty 7.0

Autoregressive semantic ID generation creates tree-induced probability correlations that prevent generative recommenders from capturing simple patterns; Latte adds latent tokens to relax these correlations.
One Pass, Any Order: Position-Invariant Listwise Reranking for LLM-Based Recommendation
cs.IR 2026-04 conditional novelty 7.0

InvariRank achieves permutation-invariant listwise reranking for LLM-based recommendations via a structured attention mask that blocks cross-candidate interactions and shared positional framing under RoPE, enabling st...
Breaking the Autoregressive Chain: Hyper-Parallel Decoding for Efficient LLM-Based Attribute Value Extraction
cs.CL 2026-04 unverdicted novelty 7.0

Hyper-Parallel Decoding enables parallel generation of independent sequences in LLMs via position ID manipulation, delivering up to 13.8X speedup for attribute value extraction.
HORIZON: A Benchmark for In-the-wild User Behaviour Modeling
cs.IR 2026-04 unverdicted novelty 7.0

HORIZON creates a cross-domain, long-horizon user modeling benchmark from Amazon Reviews that tests generalization across time, domains, and unseen users, exposing gaps in sequential and LLM-based recommendation models.
DynLP: Parallel Dynamic Batch Update for Label Propagation in Semi-Supervised Learning
cs.DC 2026-04 unverdicted novelty 7.0

DynLP is a parallel dynamic batch update algorithm for label propagation that achieves significant speedups by updating only relevant parts of the graph on GPUs.
Task-Aware Automated User Profile Generation for Recommendation Simulation Using Large Language Models
cs.IR 2026-05 unverdicted novelty 6.0

APG4RecSim automatically generates realistic user profiles for LLM-based recommendation simulations, outperforming manual baselines by up to 7% in nDCG@10 and 8% in JSD on three benchmark datasets.
CAMPA: Efficient and Aligned Multimodal Graph Learning via Decoupled Propagation and Aggregation
cs.AI 2026-05 unverdicted novelty 6.0

CAMPA resolves modal conflicts in decoupled multimodal GNNs via cross-modal aligned propagation and trajectory aligned aggregation, outperforming coupled and decoupled baselines on benchmarks while retaining efficiency.
LLM Agents Enable User-Governed Personalization Beyond Platform Boundaries
cs.IR 2026-05 unverdicted novelty 6.0

LLM agents enable users to integrate cross-platform and offline data for personalization that outperforms single-platform baselines in proof-of-concept tests.
Bridging Textual Profiles and Latent User Embeddings for Personalization
cs.IR 2026-05 unverdicted novelty 6.0

BLUE aligns LLM-generated textual user profiles with embedding-based recommendation objectives via reinforcement learning and next-item text supervision, yielding better zero-shot performance and cross-domain transfer...
PREFER: Personalized Review Summarization with Online Preference Learning
cs.AI 2026-05 unverdicted novelty 6.0

PREFER is an online preference learning system that generates personalized review summaries and improves alignment with user interests in simulations on Amazon review data.
One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving
cs.DC 2026-05 unverdicted novelty 6.0

HELM adaptively partitions HBM between EMB and KV caches via a three-layer PPO controller and EMB-KV-aware scheduling, reducing P99 latency by 24-38% while achieving 93.5-99.6% SLO satisfaction on production workloads.
Decision-aware User Simulation Agent for Evaluating Conversational Recommender Systems
cs.IR 2026-05 unverdicted novelty 6.0

Hesitator is a theory-grounded simulator that separates utility-based item selection from overload-aware commitment decisions to reduce unrealistic high acceptance rates in conversational recommender evaluations.
From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems
cs.IR 2026-04 unverdicted novelty 6.0

A unified benchmark of eleven CE methods shows effectiveness-sparsity trade-offs vary by method and format, performance is consistent from item to list level, and graph-based explainers face scalability limits on larg...
Self-Distilled Reinforcement Learning for Co-Evolving Agentic Recommender Systems
cs.IR 2026-04 unverdicted novelty 6.0

CoARS enables co-evolving recommender and user agents by using interaction-derived rewards and self-distilled credit assignment to internalize multi-turn feedback into model parameters, outperforming prior agentic baselines.
PeReGrINE: Evaluating Personalized Review Fidelity with User Item Graph Context
cs.IR 2026-04 unverdicted novelty 6.0

PeReGrINE is a graph-based benchmark that restructures Amazon Reviews 2023 with temporal cutoffs and introduces dissonance analysis to measure how well retrieval-conditioned models match user style and product consensus.
TRU: Targeted Reverse Update for Efficient Multimodal Recommendation Unlearning
cs.AI 2026-04 unverdicted novelty 6.0

TRU is a plug-and-play unlearning method for multimodal recommenders that applies ranking fusion, modality scaling, and layer isolation to achieve better retain-forget trade-offs than uniform baselines.
RcLLM: Accelerating Generative Recommendation via Beyond-Prefix KV Caching
cs.DC 2026-05 unverdicted novelty 5.0

RcLLM accelerates generative recommendation inference by 1.31x-9.51x in TTFT through beyond-prefix KV caching, replicated user caches, sharded item caches, affinity scheduling, and selective attention with negligible ...
Stable Multimodal Graph Unlearning via Feature-Dimension Aware Quantile Selection
cs.LG 2026-05 unverdicted novelty 5.0

FDQ improves stability in multimodal graph unlearning by using feature-dimension aware quantile selection to protect sensitive high-dimensional layers while preserving utility and enabling effective forgetting.
Rethinking Semantic Collaborative Integration: Why Alignment Is Not Enough
cs.IR 2026-04 unverdicted novelty 5.0

Semantic and collaborative representations show low item-level overlap on sparse data, so global alignment suppresses complementary signals and a shared-plus-private fusion design is needed instead.
Multistakeholder Impacts of Profile Portability in a Recommender Ecosystem
cs.IR 2026-04 unverdicted novelty 4.0

Data portability scenarios in algorithmic pluralism produce varying effects on user utility across different recommendation algorithms.