pith. sign in

arxiv: 2605.18805 · v1 · pith:27R2OM7Anew · submitted 2026-05-11 · 💻 cs.IR · cs.AI· cs.LG

RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents

Pith reviewed 2026-05-20 22:22 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.LG
keywords LLM recommendation agentsbehavior-grounded evaluationsemantic plausibilityset-level utilityshopping recommendation benchmarkagentic tool userelevance complementarity diversityinteraction-derived metrics
0
0 comments X

The pith

RecoAtlas benchmark shows that semantic plausibility in LLM shopping reports does not capture behavior-grounded set utility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RecoAtlas as a benchmark and toolkit for evaluating LLM recommendation agents that output sets of items with natural-language justifications. It argues that existing approaches relying on reranking small candidate lists or judging semantic coherence fall short because they do not check whether the sets deliver actual relevance, complementarity, and diversity according to user interaction patterns. RecoAtlas supplies learned utility proxies drawn from held-out interaction data and places agents in a controlled environment that can supply semantic, behavior-aligned, or faulty tools. Controlled experiments demonstrate that agent performance grows with model capacity and test-time compute, rises when tools are stronger and better aligned, falls when signals are noisy, and that high semantic scores do not guarantee high behavior-grounded utility.

Core claim

RecoAtlas equips evaluation of shopping agents with held-out interaction metrics plus learned proxies for relevance, complementarity, and diversity derived from interaction data, while separately scoring semantic coherence and explanation quality. Its controlled tool environment lets agents face semantic, behavior-aligned, or faulty signals so that gains can be attributed to reasoning, signal quality, or tool-use policy. Experiments confirm that performance scales with model capacity and test-time compute, improves with stronger aligned tools, degrades under misalignment, and that semantic plausibility does not necessarily reflect behavior-grounded utility of the resulting recommendation set

What carries the argument

The controlled tool environment that supplies agents with either semantic, behavior-aligned, or faulty tools while scoring both semantic coherence and learned utility proxies for relevance, complementarity, and diversity derived from interaction data.

If this is right

  • Agent performance increases with greater model capacity and additional test-time compute.
  • Performance rises when agents receive stronger and better-aligned tools.
  • Performance falls when agents receive noisy or misaligned signals.
  • Semantic plausibility of reports does not necessarily indicate set-level behavior-grounded utility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future agent training loops could incorporate the behavior-derived utility proxies directly rather than relying only on semantic feedback.
  • The separation of semantic and utility signals offers a template for evaluating generative agents in other domains where plausible text must be checked against measurable outcomes.
  • Benchmarks for agentic systems may need to include explicit tool-quality controls to isolate the contribution of reasoning from the quality of external signals.

Load-bearing premise

The learned utility proxies for relevance, complementarity, and diversity derived from interaction data accurately reflect real user behavior and preferences in the target shopping domain.

What would settle it

A live A/B test in which users receive recommendation sets chosen to maximize RecoAtlas utility scores versus sets chosen to maximize semantic plausibility scores, with the outcome measured by click-through or conversion rates, would settle whether the distinction holds.

Figures

Figures reproduced from arXiv: 2605.18805 by Alexandre Gilotte, Benjamin Heymann, Flavian Vasile, Imad Aouali, Otmane Sakhi.

Figure 1
Figure 1. Figure 1: System overview of RecoAtlas. A query-generation module creates comparative-shopping and bundle-shopping tasks with held-out ground-truth items. An agent searches the catalog through recommendation-specific tools for relevance, complementarity, and substitutability, then produces a fixed-size report. The report-level evaluator scores the final set using exact recovery (SetHit@K), learned reward models for … view at source ↗
Figure 2
Figure 2. Figure 2: Alignment between LLM-judge scoring and SetHit@20. The agent uses GPT-4.1-mini. In this experiment, we separate eval￾uation into two metrics. The first family consists of LLM judges, which score the final report along axes such as relevance, complementarity, diver￾sity, and explanation quality. The sec￾ond is the SetHit@20 that measures whether the agent’s returned set re￾covers held-out behaviorally groun… view at source ↗
Figure 3
Figure 3. Figure 3: Effect of the number of tools on SetHit@20. The agent uses GPT-4.1-mini [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Model-size and test-time scaling. The agent uses [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Main leaderboard under the utility-aligned tool setting. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Reward-model decomposition for representative [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: SetHit@20 retention at a 50% faulty-tool rate [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Effect of increasing faulty-tool rate on SetHit@20 [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

LLM recommendation agents increasingly produce structured recommendation reports: sets of items accompanied by natural-language justifications. Yet existing evaluations often reduce this setting to reranking small shortlisted candidate sets or judge reports mainly by semantic plausibility. We introduce Recommendation Atlas (Agentic Tool-Level Assessment for Shopping), or RecoAtlas, a benchmark and toolkit for evaluating shopping agents with behavior-grounded metrics. RecoAtlas complements held-out interaction metrics with learned utility proxies for relevance, complementarity, and diversity derived from interaction data, while separately measuring semantic coherence and explanation quality. Its controlled tool environment exposes agents to either semantic, behavior-aligned, or faulty tools, enabling diagnosis of whether performance gains arise from stronger reasoning, better signals, or more effective tool-use policies. Across controlled experiments, we show that RecoAtlas exhibits key properties of a meaningful benchmark for agentic systems: performance scales with model capacity and test-time compute, improves with stronger and better-aligned tools, degrades under noisy or misaligned signals, and reveals that semantic plausibility does not necessarily capture behavior-grounded utility. RecoAtlas provides a foundation for developing and evaluating shopping assistants that optimize not only for plausible recommendations, but also for coherent, behaviorally grounded recommendation sets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces RecoAtlas, a benchmark and toolkit for evaluating LLM-based shopping recommendation agents. It complements held-out interaction metrics with learned utility proxies for relevance, complementarity, and diversity derived from interaction data, while also measuring semantic coherence and explanation quality. Controlled experiments vary model capacity, test-time compute, tool alignment, and signal noise to show that performance scales with capacity and compute, improves with stronger aligned tools, degrades under faulty signals, and that semantic plausibility does not necessarily reflect behavior-grounded utility.

Significance. If the learned proxies are independently validated against held-out user behaviors without leakage, RecoAtlas would offer a useful diagnostic framework for agentic recommendation systems, clarifying when gains come from reasoning versus tool signals and separating semantic from utility evaluation. The scaling and tool-alignment results could inform development of shopping assistants that optimize for coherent, behaviorally grounded sets.

major comments (3)
  1. [Abstract] Abstract and utility-proxy construction: the claim that proxies capture 'behavior-grounded utility' distinct from semantic plausibility rests on their derivation from interaction data, yet no details are given on the training objective, feature set, or explicit correlation checks (e.g., against purchase rates or click sequences withheld from proxy fitting). This is load-bearing for the dissociation result and for interpreting the scaling/tool-alignment experiments.
  2. [Experimental setup] Data-split and leakage section: the manuscript states that held-out interaction metrics are used alongside the proxies, but does not specify whether the interaction data used to fit the relevance/complementarity/diversity proxies overlaps with the held-out sets or shares the same user-item distribution. Without an independent validation split or reported correlation to external behavioral signals, the separation between semantic and utility signals risks being artifactual.
  3. [Results] Table or figure reporting proxy validation: no quantitative evidence (e.g., proxy-to-held-out correlation coefficients or ablation on proxy training data) is referenced to confirm that the learned proxies track real user behavior rather than overfitting to the training interactions. This directly affects the interpretability of the 'semantic plausibility does not capture utility' finding.
minor comments (2)
  1. [Method] Notation for the three utility proxies (relevance, complementarity, diversity) should be defined once with explicit formulas or pseudocode rather than described only in prose.
  2. [Figures] Figure captions for the scaling and tool-alignment plots should include error bars or statistical significance markers to support the reported trends.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where additional clarity on proxy construction, data handling, and validation would strengthen the interpretability of RecoAtlas. We address each major comment below and will revise the manuscript accordingly to incorporate the requested details without altering the core claims or experimental design.

read point-by-point responses
  1. Referee: [Abstract] Abstract and utility-proxy construction: the claim that proxies capture 'behavior-grounded utility' distinct from semantic plausibility rests on their derivation from interaction data, yet no details are given on the training objective, feature set, or explicit correlation checks (e.g., against purchase rates or click sequences withheld from proxy fitting). This is load-bearing for the dissociation result and for interpreting the scaling/tool-alignment experiments.

    Authors: The proxies are derived from interaction data as summarized in the methods, using models trained to predict observed user behaviors such as purchases and co-occurrences. We agree that the manuscript would benefit from expanded description of the precise training objective, input features (e.g., item metadata and sequence statistics), and quantitative checks against withheld signals. In the revision we will add these specifics to the proxy construction subsection and include correlation results with held-out purchase indicators to support the dissociation between semantic plausibility and behavior-grounded utility. revision: yes

  2. Referee: [Experimental setup] Data-split and leakage section: the manuscript states that held-out interaction metrics are used alongside the proxies, but does not specify whether the interaction data used to fit the relevance/complementarity/diversity proxies overlaps with the held-out sets or shares the same user-item distribution. Without an independent validation split or reported correlation to external behavioral signals, the separation between semantic and utility signals risks being artifactual.

    Authors: The experimental setup employs a temporal split in which proxy training occurs on earlier interactions while held-out metrics are computed on later periods, ensuring instance-level disjointness. We acknowledge that the current text does not explicitly rule out distributional overlap or describe an independent validation split for the proxies themselves. The revision will include a dedicated paragraph and diagram clarifying the split ratios, confirming that proxy fitting and held-out evaluation use non-overlapping user-item instances drawn from the same overall distribution, and noting any available external behavioral correlations. revision: yes

  3. Referee: [Results] Table or figure reporting proxy validation: no quantitative evidence (e.g., proxy-to-held-out correlation coefficients or ablation on proxy training data) is referenced to confirm that the learned proxies track real user behavior rather than overfitting to the training interactions. This directly affects the interpretability of the 'semantic plausibility does not capture utility' finding.

    Authors: The dissociation result is currently supported by the divergence observed across controlled conditions in the main experiments. We agree that direct quantitative validation evidence would improve rigor and address potential overfitting concerns. The revised manuscript will add a new table (or subsection) reporting proxy-to-held-out correlation coefficients together with an ablation on training data volume to demonstrate that the proxies track real user behavior. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces RecoAtlas as a benchmark that pairs held-out interaction metrics with separately learned utility proxies for relevance, complementarity, and diversity, while measuring semantic coherence independently. No equations, self-citations, or ansatzes are quoted that reduce the central claims (scaling with model capacity, tool alignment effects, or semantic-vs-utility separation) to fitted inputs or prior author work by construction. The described evaluation structure treats the proxies as derived from interaction data but complements them with held-out sets and controlled tool environments, keeping the diagnostic claims externally falsifiable rather than tautological. This is the most common honest outcome for a benchmark paper whose core contribution is an experimental toolkit rather than a closed-form derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the utility proxies are described as learned from data but without derivation details.

pith-pipeline@v0.9.0 · 5763 in / 1023 out tokens · 45849 ms · 2026-05-20T22:22:57.450347+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 7 internal anchors

  1. [1]

    Agentrecbench: Benchmarking llm agent-based personalized recommender systems.arXiv preprint arXiv:2505.19623, 2025

    Yu Shang, Peijie Liu, Yuwei Yan, Zijing Wu, Leheng Sheng, Yuanqing Yu, Chumeng Jiang, An Zhang, Fengli Xu, Yu Wang, Min Zhang, and Yong Li. Agentrecbench: Benchmarking llm agent-based personalized recommender systems.arXiv preprint arXiv:2505.19623, 2025

  2. [2]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating LLMs as agents.arXiv preprint arXiv:2308.03688, 2023

  3. [3]

    GAIA: a benchmark for General AI Assistants

    Grégoire Mialon, Roberto Dessì, Maria Lomeli, David Esiobu, Yizhong Chen, Kai Arulkumaran, Antoine Cully, Cyprien de Masson d’Autume, Hadi Eshragh, Nacér Hassen, Zachary Kenton, Andrew Li, Aravind Mahendran, Daniel Mankowitz, Piotr Mirowski, Anna Rogers, Hubert Soyer, Nino Vieillard, Martha White, Yuting Yang, et al. GAIA: A benchmark for general AI assis...

  4. [4]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789, 2023. 10

  5. [5]

    On the tool manipulation capability of open-source large language models.arXiv preprint arXiv:2305.16504, 2023

    Qiantong Xu, Fenglu Hong, Bo Li, Changran Hu, Zhengyu Chen, and Jian Zhang. On the tool manipulation capability of open-source large language models.arXiv preprint arXiv:2305.16504, 2023

  6. [6]

    Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models

    Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu. Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11143–11156, Bangkok, Thailand, 2024. Association for Computational L...

  7. [7]

    Gorilla: Large Language Model Connected with Massive APIs

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis.arXiv preprint arXiv:2305.15334, 2023

  8. [8]

    Webshop: Towards scalable real-world web interaction with grounded language agents.arXiv preprint arXiv:2207.01206, 2022

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.arXiv preprint arXiv:2207.01206, 2022

  9. [9]

    Shopping mmlu: A massive multi-task online shopping benchmark for large language models.arXiv preprint arXiv:2410.20745, 2024

    Yilun Jin, Zheng Li, Chenwei Zhang, Tianyu Cao, Yifan Gao, Pratik Jayarao, Mao Li, Xin Liu, Ritesh Sarkhel, Xianfeng Tang, Haodong Wang, Zhengyang Wang, Wenju Xu, Jingfeng Yang, Qingyu Yin, Xian Li, Priyanka Nigam, Yi Xu, Kai Chen, Qiang Yang, Meng Jiang, and Bing Yin. Shopping mmlu: A massive multi-task online shopping benchmark for large language models...

  10. [10]

    Shoppingbench: A real-world intent-grounded shopping benchmark for llm-based agents

    Jiangyuan Wang, Kejun Xiao, Qi Sun, Huaipeng Zhao, Tao Luo, Jian Dong Zhang, and Xiaoyi Zeng. Shoppingbench: A real-world intent-grounded shopping benchmark for llm-based agents. arXiv preprint arXiv:2508.04266, 2025

  11. [11]

    WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents

    Ralph Peeters, Aaron Steiner, Luca Schwarz, Julian Yuya Caspary, and Christian Bizer. Webmall – a multi-shop benchmark for evaluating web agents.arXiv preprint arXiv:2508.13024, 2025

  12. [12]

    Shoppingcomp: Are llms really ready for your shopping cart?arXiv preprint arXiv:2511.22978, 2025

    Huaixiao Tou, Ying Zeng, Cong Ma, Muzhi Li, Minghao Li, Weijie Yuan, He Zhang, and Kai Jia. Shoppingcomp: Are llms really ready for your shopping cart?arXiv preprint arXiv:2511.22978, 2025

  13. [13]

    Deepshop: A benchmark for deep research shopping agents.arXiv preprint arXiv:2506.02839, 2025

    Yougang Lyu, Xiaoyu Zhang, Lingyong Yan, Maarten de Rijke, Zhaochun Ren, and Xi- uying Chen. Deepshop: A benchmark for deep research shopping agents.arXiv preprint arXiv:2506.02839, 2025

  14. [14]

    Recmind: Large language model powered agent for recommendation.arXiv preprint arXiv:2308.14296, 2023

    Yancheng Wang, Ziyan Jiang, Zheng Chen, Fan Yang, Yingxue Zhou, Eunah Cho, Xing Fan, Xiaojiang Huang, Yanbin Lu, and Yingzhen Yang. Recmind: Large language model powered agent for recommendation.arXiv preprint arXiv:2308.14296, 2023

  15. [15]

    Let me do it for you: Towards llm empowered recommendation via tool learning

    Yuyue Zhao, Jiancan Wu, Xiang Wang, Wei Tang, Dingxian Wang, and Maarten de Rijke. Let me do it for you: Towards llm empowered recommendation via tool learning. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1796–1806. Association for Computing Machinery, 2024

  16. [16]

    Towards agentic recommender systems in the era of multimodal large language models.arXiv preprint arXiv:2503.16734, 2025

    Chengkai Huang, Junda Wu, Yu Xia, et al. Towards agentic recommender systems in the era of multimodal large language models.arXiv preprint arXiv:2503.16734, 2025

  17. [17]

    A survey on llm-powered agents for recommender systems.arXiv preprint arXiv:2502.10050, 2025

    Qiyao Peng, Hongtao Liu, Hua Huang, Qing Yang, and Minglai Shao. A survey on llm-powered agents for recommender systems.arXiv preprint arXiv:2502.10050, 2025

  18. [18]

    Slateq: A tractable decomposition for reinforcement learning with recommendation sets

    Eugene Ie, Vihan Jain, Jing Wang, Sanmit Narvekar, Ritesh Agarwal, Rui Wu, Heng-Tze Cheng, Morgane Lustman, Vince Gatto, Paul Covington, Jim McFadden, Tushar Chandra, and Craig Boutilier. Slateq: A tractable decomposition for reinforcement learning with recommendation sets. InProceedings of the Twenty-Eighth International Joint Conference on Artificial In...

  19. [19]

    Slate-aware ranking for recommendation

    Yi Ren, Xiao Han, Xu Zhao, Shenzheng Zhang, and Yan Zhang. Slate-aware ranking for recommendation. InProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pages 499–507. Association for Computing Machinery, 2023. 11

  20. [20]

    Generative slate recommendation with reinforcement learning

    Romain Deffayet, Thibault Thonet, Walid Bendada, Guillaume Bisson, Fabrice Popineau, Lynda Tamine, and Jefrey Lijffijt. Generative slate recommendation with reinforcement learning. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pages 580–588. Association for Computing Machinery, 2023

  21. [21]

    Generating and personalizing bundle recommendations on steam

    Apurva Pathak, Kshitiz Gupta, and Julian McAuley. Generating and personalizing bundle recommendations on steam. InProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1073–1076. Association for Computing Machinery, 2017

  22. [22]

    Matching user with item set: Collaborative bundle recommendation with deep attention network

    Liang Chen, Yang Liu, Xiangnan He, Lianli Gao, and Zibin Zheng. Matching user with item set: Collaborative bundle recommendation with deep attention network. InProceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, pages 2095–2101. International Joint Conferences on Artificial Intelligence Organization, 2019

  23. [23]

    Bundle recommendation with graph convolutional networks

    Jianxin Chang, Chen Gao, Xiangnan He, Depeng Jin, and Yong Li. Bundle recommendation with graph convolutional networks. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1673–1676. Association for Computing Machinery, 2020

  24. [24]

    Diversity in recommender systems–a survey.Knowledge- Based Systems, 123:154–162, 2017

    Matevž Kunaver and Tomaž Požrl. Diversity in recommender systems–a survey.Knowledge- Based Systems, 123:154–162, 2017

  25. [25]

    Marius Kaminskas and Derek Bridge. Diversity, serendipity, novelty, and coverage: A sur- vey and empirical analysis of beyond-accuracy objectives in recommender systems.ACM Transactions on Interactive Intelligent Systems, 7(1), 2016

  26. [26]

    Inferring networks of substitutable and complementary products

    Julian McAuley et al. Inferring networks of substitutable and complementary products. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015

  27. [27]

    P-companion: A principled framework for diversified complementary product recom- mendation

    Junheng Hao, Tong Zhao, Jin Li, Xin Luna Dong, Christos Faloutsos, Yizhou Sun, and Wei Wang. P-companion: A principled framework for diversified complementary product recom- mendation. InProceedings of the 29th ACM International Conference on Information and Knowledge Management, pages 2517–2524, 2020

  28. [28]

    Is it really complementary? revisiting behavior-based labels for complementary recommendation

    Kai Sugahara, Chihiro Yamasaki, and Kazushi Okamoto. Is it really complementary? revisiting behavior-based labels for complementary recommendation. InProceedings of the 18th ACM Conference on Recommender Systems (RecSys), 2024

  29. [29]

    Explainable recommendation: A survey and new perspectives

    Yongfeng Zhang and Xu Chen. Explainable recommendation: A survey and new perspectives. Foundations and Trends® in Information Retrieval, 14(1):1–101, 2020

  30. [30]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. InAdvances in Neural Information Processing Systems, volume 36, 2024

  31. [31]

    Humans or llms as the judge? a study on judgement biases.arXiv preprint arXiv:2402.10669, 2024

    Haoning Wu et al. Humans or llms as the judge? a study on judgement biases.arXiv preprint arXiv:2402.10669, 2024

  32. [32]

    Can LLM be a personalized judge? In Findings of the Association for Computational Linguistics: EMNLP 2024, 2024

    Yijiang River Dong, Tiancheng Hu, and Nigel Collier. Can LLM be a personalized judge? In Findings of the Association for Computational Linguistics: EMNLP 2024, 2024

  33. [33]

    No free labels: Limitations of llm-as-a-judge without human grounding.arXiv preprint arXiv:2503.05061,

    Michael Krumdick, Charles Lovering, Varshini Reddy, Seth Ebner, and Chris Tanner. No free labels: Limitations of LLM-as-a-judge without human grounding.arXiv preprint arXiv:2503.05061, 2025

  34. [34]

    RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising

    David Rohde, Stephen Bonner, Travis Dunlop, Flavian Vasile, and Alexandros Karatzoglou. Recogym: A reinforcement learning environment for the problem of product recommendation in online advertising.arXiv preprint arXiv:1808.00720, 2018

  35. [35]

    Recsim: A configurable simulation platform for recommender systems

    Eugene Ie, Chih-Wei Hsu, Martin Mladenov, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu, and Craig Boutilier. Recsim: A configurable simulation platform for recommender systems. arXiv preprint arXiv:1909.04847, 2019. 12

  36. [36]

    Recsim NG: Toward principled uncertainty modeling for recommender ecosystems.arXiv preprint arXiv:2103.08057, 2021

    Martin Mladenov, Chih-Wei Hsu, Vihan Jain, Eugene Ie, Christopher Colby, Nicolas Mayoraz, Hubert Pham, Dustin Tran, Ivan Vendrov, and Craig Boutilier. Recsim NG: Toward principled uncertainty modeling for recommender ecosystems.arXiv preprint arXiv:2103.08057, 2021

  37. [37]

    Kuaisim: A comprehensive simulator for recommender systems.Advances in Neural Information Processing Systems, 36:44880–44897, 2023

    Kesen Zhao, Shuchang Liu, Qingpeng Cai, Xiangyu Zhao, Ziru Liu, Dong Zheng, Peng Jiang, and Kun Gai. Kuaisim: A comprehensive simulator for recommender systems.Advances in Neural Information Processing Systems, 36:44880–44897, 2023

  38. [38]

    Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

    Yupeng Hou, Jiacheng Li, Zhankui He, An Yan, Xiusi Chen, and Julian McAuley. Bridging language and items for retrieval and recommendation.arXiv preprint arXiv:2403.03952, 2024

  39. [39]

    Introducing GPT-4.1 in the API, April 2025

    OpenAI. Introducing GPT-4.1 in the API, April 2025. Includes GPT-4.1 mini

  40. [40]

    Grok 4.1 fast and agent tools API, November 2025

    xAI. Grok 4.1 fast and agent tools API, November 2025

  41. [41]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

  42. [42]

    Claude Haiku 4.5 system card, October 2025

    Anthropic. Claude Haiku 4.5 system card, October 2025

  43. [43]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. Qwen3 technical report, 2025

  44. [44]

    Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay.arXiv preprint arXiv:2504.03601, 2025

    Akshara Prabhakar, Zuxin Liu, Ming Zhu, Jianguo Zhang, Tulika Awalgaonkar, Shiyu Wang, Zhiwei Liu, Haolin Chen, Thai Hoang, et al. APIGen-MT: Agentic pipeline for multi-turn data generation via simulated agent-human interplay.arXiv preprint arXiv:2504.03601, 2025

  45. [45]

    Llama-Nemotron: Efficient reasoning models, 2025

    Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, et al. Llama-Nemotron: Efficient reasoning models, 2025. 13 A Benchmark artifact build RecoAtlasbuilds one category-specific environment per Amazon product category. Each environment consists of a filtered catalog I, product texts xi for every item i∈ I , subcategory metada...

  46. [46]

    for live performance

    In the experiments, we use subcategory-adversarial corruption. For search_products and get_complementary_products, let K be the number of clean returned slots and ρ∈[0,1] the faulty-tool rate. The corruption selects round(Kρ) slots uniformly without replacement. Each corrupted slot keeps its displayed score, but its item ID, title, and description are rep...