RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents

Alexandre Gilotte; Benjamin Heymann; Flavian Vasile; Imad Aouali; Otmane Sakhi

arxiv: 2605.18805 · v1 · pith:27R2OM7Anew · submitted 2026-05-11 · 💻 cs.IR · cs.AI· cs.LG

RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents

Imad Aouali , Flavian Vasile , Otmane Sakhi , Alexandre Gilotte , Benjamin Heymann This is my paper

Pith reviewed 2026-05-20 22:22 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.LG

keywords LLM recommendation agentsbehavior-grounded evaluationsemantic plausibilityset-level utilityshopping recommendation benchmarkagentic tool userelevance complementarity diversityinteraction-derived metrics

0 comments

The pith

RecoAtlas benchmark shows that semantic plausibility in LLM shopping reports does not capture behavior-grounded set utility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RecoAtlas as a benchmark and toolkit for evaluating LLM recommendation agents that output sets of items with natural-language justifications. It argues that existing approaches relying on reranking small candidate lists or judging semantic coherence fall short because they do not check whether the sets deliver actual relevance, complementarity, and diversity according to user interaction patterns. RecoAtlas supplies learned utility proxies drawn from held-out interaction data and places agents in a controlled environment that can supply semantic, behavior-aligned, or faulty tools. Controlled experiments demonstrate that agent performance grows with model capacity and test-time compute, rises when tools are stronger and better aligned, falls when signals are noisy, and that high semantic scores do not guarantee high behavior-grounded utility.

Core claim

RecoAtlas equips evaluation of shopping agents with held-out interaction metrics plus learned proxies for relevance, complementarity, and diversity derived from interaction data, while separately scoring semantic coherence and explanation quality. Its controlled tool environment lets agents face semantic, behavior-aligned, or faulty signals so that gains can be attributed to reasoning, signal quality, or tool-use policy. Experiments confirm that performance scales with model capacity and test-time compute, improves with stronger aligned tools, degrades under misalignment, and that semantic plausibility does not necessarily reflect behavior-grounded utility of the resulting recommendation set

What carries the argument

The controlled tool environment that supplies agents with either semantic, behavior-aligned, or faulty tools while scoring both semantic coherence and learned utility proxies for relevance, complementarity, and diversity derived from interaction data.

If this is right

Agent performance increases with greater model capacity and additional test-time compute.
Performance rises when agents receive stronger and better-aligned tools.
Performance falls when agents receive noisy or misaligned signals.
Semantic plausibility of reports does not necessarily indicate set-level behavior-grounded utility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future agent training loops could incorporate the behavior-derived utility proxies directly rather than relying only on semantic feedback.
The separation of semantic and utility signals offers a template for evaluating generative agents in other domains where plausible text must be checked against measurable outcomes.
Benchmarks for agentic systems may need to include explicit tool-quality controls to isolate the contribution of reasoning from the quality of external signals.

Load-bearing premise

The learned utility proxies for relevance, complementarity, and diversity derived from interaction data accurately reflect real user behavior and preferences in the target shopping domain.

What would settle it

A live A/B test in which users receive recommendation sets chosen to maximize RecoAtlas utility scores versus sets chosen to maximize semantic plausibility scores, with the outcome measured by click-through or conversion rates, would settle whether the distinction holds.

Figures

Figures reproduced from arXiv: 2605.18805 by Alexandre Gilotte, Benjamin Heymann, Flavian Vasile, Imad Aouali, Otmane Sakhi.

**Figure 1.** Figure 1: System overview of RecoAtlas. A query-generation module creates comparative-shopping and bundle-shopping tasks with held-out ground-truth items. An agent searches the catalog through recommendation-specific tools for relevance, complementarity, and substitutability, then produces a fixed-size report. The report-level evaluator scores the final set using exact recovery (SetHit@K), learned reward models for … view at source ↗

**Figure 2.** Figure 2: Alignment between LLM-judge scoring and SetHit@20. The agent uses GPT-4.1-mini. In this experiment, we separate evaluation into two metrics. The first family consists of LLM judges, which score the final report along axes such as relevance, complementarity, diversity, and explanation quality. The second is the SetHit@20 that measures whether the agent’s returned set recovers held-out behaviorally groun… view at source ↗

**Figure 3.** Figure 3: Effect of the number of tools on SetHit@20. The agent uses GPT-4.1-mini [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Model-size and test-time scaling. The agent uses [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Main leaderboard under the utility-aligned tool setting. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Reward-model decomposition for representative [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: SetHit@20 retention at a 50% faulty-tool rate [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Effect of increasing faulty-tool rate on SetHit@20 [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

LLM recommendation agents increasingly produce structured recommendation reports: sets of items accompanied by natural-language justifications. Yet existing evaluations often reduce this setting to reranking small shortlisted candidate sets or judge reports mainly by semantic plausibility. We introduce Recommendation Atlas (Agentic Tool-Level Assessment for Shopping), or RecoAtlas, a benchmark and toolkit for evaluating shopping agents with behavior-grounded metrics. RecoAtlas complements held-out interaction metrics with learned utility proxies for relevance, complementarity, and diversity derived from interaction data, while separately measuring semantic coherence and explanation quality. Its controlled tool environment exposes agents to either semantic, behavior-aligned, or faulty tools, enabling diagnosis of whether performance gains arise from stronger reasoning, better signals, or more effective tool-use policies. Across controlled experiments, we show that RecoAtlas exhibits key properties of a meaningful benchmark for agentic systems: performance scales with model capacity and test-time compute, improves with stronger and better-aligned tools, degrades under noisy or misaligned signals, and reveals that semantic plausibility does not necessarily capture behavior-grounded utility. RecoAtlas provides a foundation for developing and evaluating shopping assistants that optimize not only for plausible recommendations, but also for coherent, behaviorally grounded recommendation sets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RecoAtlas sets up a diagnostic benchmark for LLM shopping agents but its utility proxies require stronger independent validation.

read the letter

RecoAtlas sets up a benchmark for LLM recommendation agents that tries to measure behavior-grounded utility for recommendation sets instead of relying mainly on semantic plausibility. The new element is the learned proxies for relevance, complementarity, and diversity drawn from interaction data, together with a tool environment that tests agents under semantic, aligned, or faulty conditions. The experiments demonstrate scaling with model capacity and test-time compute, gains from better tools, and drops with noisy signals. They also highlight cases where semantic plausibility fails to capture the utility aspects. This approach is solid for diagnosing sources of performance in agentic recommendation systems. It gives a way to check if improvements stem from better reasoning or from the quality of the signals provided. The main soft spot is in the construction and validation of those utility proxies. The paper needs to show clearly how data splits prevent leakage and whether the proxies correlate with real held-out behaviors like purchases. Without that, the distinction between semantic and utility signals risks being less meaningful than presented. This paper targets researchers in recommendation systems and LLM agents, particularly those focused on e-commerce applications. A reader working on evaluation methods for structured outputs from agents would get practical ideas from the controlled tests. The work shows honest engagement with the evaluation challenges in this area. It deserves a serious referee because the benchmark framework and diagnostic experiments are worth detailed scrutiny. I would recommend sending it to peer review.

Referee Report

3 major / 2 minor

Summary. The paper introduces RecoAtlas, a benchmark and toolkit for evaluating LLM-based shopping recommendation agents. It complements held-out interaction metrics with learned utility proxies for relevance, complementarity, and diversity derived from interaction data, while also measuring semantic coherence and explanation quality. Controlled experiments vary model capacity, test-time compute, tool alignment, and signal noise to show that performance scales with capacity and compute, improves with stronger aligned tools, degrades under faulty signals, and that semantic plausibility does not necessarily reflect behavior-grounded utility.

Significance. If the learned proxies are independently validated against held-out user behaviors without leakage, RecoAtlas would offer a useful diagnostic framework for agentic recommendation systems, clarifying when gains come from reasoning versus tool signals and separating semantic from utility evaluation. The scaling and tool-alignment results could inform development of shopping assistants that optimize for coherent, behaviorally grounded sets.

major comments (3)

[Abstract] Abstract and utility-proxy construction: the claim that proxies capture 'behavior-grounded utility' distinct from semantic plausibility rests on their derivation from interaction data, yet no details are given on the training objective, feature set, or explicit correlation checks (e.g., against purchase rates or click sequences withheld from proxy fitting). This is load-bearing for the dissociation result and for interpreting the scaling/tool-alignment experiments.
[Experimental setup] Data-split and leakage section: the manuscript states that held-out interaction metrics are used alongside the proxies, but does not specify whether the interaction data used to fit the relevance/complementarity/diversity proxies overlaps with the held-out sets or shares the same user-item distribution. Without an independent validation split or reported correlation to external behavioral signals, the separation between semantic and utility signals risks being artifactual.
[Results] Table or figure reporting proxy validation: no quantitative evidence (e.g., proxy-to-held-out correlation coefficients or ablation on proxy training data) is referenced to confirm that the learned proxies track real user behavior rather than overfitting to the training interactions. This directly affects the interpretability of the 'semantic plausibility does not capture utility' finding.

minor comments (2)

[Method] Notation for the three utility proxies (relevance, complementarity, diversity) should be defined once with explicit formulas or pseudocode rather than described only in prose.
[Figures] Figure captions for the scaling and tool-alignment plots should include error bars or statistical significance markers to support the reported trends.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where additional clarity on proxy construction, data handling, and validation would strengthen the interpretability of RecoAtlas. We address each major comment below and will revise the manuscript accordingly to incorporate the requested details without altering the core claims or experimental design.

read point-by-point responses

Referee: [Abstract] Abstract and utility-proxy construction: the claim that proxies capture 'behavior-grounded utility' distinct from semantic plausibility rests on their derivation from interaction data, yet no details are given on the training objective, feature set, or explicit correlation checks (e.g., against purchase rates or click sequences withheld from proxy fitting). This is load-bearing for the dissociation result and for interpreting the scaling/tool-alignment experiments.

Authors: The proxies are derived from interaction data as summarized in the methods, using models trained to predict observed user behaviors such as purchases and co-occurrences. We agree that the manuscript would benefit from expanded description of the precise training objective, input features (e.g., item metadata and sequence statistics), and quantitative checks against withheld signals. In the revision we will add these specifics to the proxy construction subsection and include correlation results with held-out purchase indicators to support the dissociation between semantic plausibility and behavior-grounded utility. revision: yes
Referee: [Experimental setup] Data-split and leakage section: the manuscript states that held-out interaction metrics are used alongside the proxies, but does not specify whether the interaction data used to fit the relevance/complementarity/diversity proxies overlaps with the held-out sets or shares the same user-item distribution. Without an independent validation split or reported correlation to external behavioral signals, the separation between semantic and utility signals risks being artifactual.

Authors: The experimental setup employs a temporal split in which proxy training occurs on earlier interactions while held-out metrics are computed on later periods, ensuring instance-level disjointness. We acknowledge that the current text does not explicitly rule out distributional overlap or describe an independent validation split for the proxies themselves. The revision will include a dedicated paragraph and diagram clarifying the split ratios, confirming that proxy fitting and held-out evaluation use non-overlapping user-item instances drawn from the same overall distribution, and noting any available external behavioral correlations. revision: yes
Referee: [Results] Table or figure reporting proxy validation: no quantitative evidence (e.g., proxy-to-held-out correlation coefficients or ablation on proxy training data) is referenced to confirm that the learned proxies track real user behavior rather than overfitting to the training interactions. This directly affects the interpretability of the 'semantic plausibility does not capture utility' finding.

Authors: The dissociation result is currently supported by the divergence observed across controlled conditions in the main experiments. We agree that direct quantitative validation evidence would improve rigor and address potential overfitting concerns. The revised manuscript will add a new table (or subsection) reporting proxy-to-held-out correlation coefficients together with an ablation on training data volume to demonstrate that the proxies track real user behavior. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces RecoAtlas as a benchmark that pairs held-out interaction metrics with separately learned utility proxies for relevance, complementarity, and diversity, while measuring semantic coherence independently. No equations, self-citations, or ansatzes are quoted that reduce the central claims (scaling with model capacity, tool alignment effects, or semantic-vs-utility separation) to fitted inputs or prior author work by construction. The described evaluation structure treats the proxies as derived from interaction data but complements them with held-out sets and controlled tool environments, keeping the diagnostic claims externally falsifiable rather than tautological. This is the most common honest outcome for a benchmark paper whose core contribution is an experimental toolkit rather than a closed-form derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the utility proxies are described as learned from data but without derivation details.

pith-pipeline@v0.9.0 · 5763 in / 1023 out tokens · 45849 ms · 2026-05-20T22:22:57.450347+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

learned utility proxies for relevance, complementarity, and diversity derived from interaction data
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

performance scales with model capacity and test-time compute, improves with stronger and better-aligned tools

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 7 internal anchors

[1]

Agentrecbench: Benchmarking llm agent-based personalized recommender systems.arXiv preprint arXiv:2505.19623, 2025

Yu Shang, Peijie Liu, Yuwei Yan, Zijing Wu, Leheng Sheng, Yuanqing Yu, Chumeng Jiang, An Zhang, Fengli Xu, Yu Wang, Min Zhang, and Yong Li. Agentrecbench: Benchmarking llm agent-based personalized recommender systems.arXiv preprint arXiv:2505.19623, 2025

work page arXiv 2025
[2]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating LLMs as agents.arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Roberto Dessì, Maria Lomeli, David Esiobu, Yizhong Chen, Kai Arulkumaran, Antoine Cully, Cyprien de Masson d’Autume, Hadi Eshragh, Nacér Hassen, Zachary Kenton, Andrew Li, Aravind Mahendran, Daniel Mankowitz, Piotr Mirowski, Anna Rogers, Hubert Soyer, Nino Vieillard, Martha White, Yuting Yang, et al. GAIA: A benchmark for general AI assis...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

On the tool manipulation capability of open-source large language models.arXiv preprint arXiv:2305.16504, 2023

Qiantong Xu, Fenglu Hong, Bo Li, Changran Hu, Zhengyu Chen, and Jian Zhang. On the tool manipulation capability of open-source large language models.arXiv preprint arXiv:2305.16504, 2023

work page arXiv 2023
[6]

Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models

Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu. Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11143–11156, Bangkok, Thailand, 2024. Association for Computational L...

work page 2024
[7]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis.arXiv preprint arXiv:2305.15334, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Webshop: Towards scalable real-world web interaction with grounded language agents.arXiv preprint arXiv:2207.01206, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.arXiv preprint arXiv:2207.01206, 2022

work page arXiv 2022
[9]

Shopping mmlu: A massive multi-task online shopping benchmark for large language models.arXiv preprint arXiv:2410.20745, 2024

Yilun Jin, Zheng Li, Chenwei Zhang, Tianyu Cao, Yifan Gao, Pratik Jayarao, Mao Li, Xin Liu, Ritesh Sarkhel, Xianfeng Tang, Haodong Wang, Zhengyang Wang, Wenju Xu, Jingfeng Yang, Qingyu Yin, Xian Li, Priyanka Nigam, Yi Xu, Kai Chen, Qiang Yang, Meng Jiang, and Bing Yin. Shopping mmlu: A massive multi-task online shopping benchmark for large language models...

work page arXiv 2024
[10]

Shoppingbench: A real-world intent-grounded shopping benchmark for llm-based agents

Jiangyuan Wang, Kejun Xiao, Qi Sun, Huaipeng Zhao, Tao Luo, Jian Dong Zhang, and Xiaoyi Zeng. Shoppingbench: A real-world intent-grounded shopping benchmark for llm-based agents. arXiv preprint arXiv:2508.04266, 2025

work page arXiv 2025
[11]

WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents

Ralph Peeters, Aaron Steiner, Luca Schwarz, Julian Yuya Caspary, and Christian Bizer. Webmall – a multi-shop benchmark for evaluating web agents.arXiv preprint arXiv:2508.13024, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Shoppingcomp: Are llms really ready for your shopping cart?arXiv preprint arXiv:2511.22978, 2025

Huaixiao Tou, Ying Zeng, Cong Ma, Muzhi Li, Minghao Li, Weijie Yuan, He Zhang, and Kai Jia. Shoppingcomp: Are llms really ready for your shopping cart?arXiv preprint arXiv:2511.22978, 2025

work page arXiv 2025
[13]

Deepshop: A benchmark for deep research shopping agents.arXiv preprint arXiv:2506.02839, 2025

Yougang Lyu, Xiaoyu Zhang, Lingyong Yan, Maarten de Rijke, Zhaochun Ren, and Xi- uying Chen. Deepshop: A benchmark for deep research shopping agents.arXiv preprint arXiv:2506.02839, 2025

work page arXiv 2025
[14]

Recmind: Large language model powered agent for recommendation.arXiv preprint arXiv:2308.14296, 2023

Yancheng Wang, Ziyan Jiang, Zheng Chen, Fan Yang, Yingxue Zhou, Eunah Cho, Xing Fan, Xiaojiang Huang, Yanbin Lu, and Yingzhen Yang. Recmind: Large language model powered agent for recommendation.arXiv preprint arXiv:2308.14296, 2023

work page arXiv 2023
[15]

Let me do it for you: Towards llm empowered recommendation via tool learning

Yuyue Zhao, Jiancan Wu, Xiang Wang, Wei Tang, Dingxian Wang, and Maarten de Rijke. Let me do it for you: Towards llm empowered recommendation via tool learning. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1796–1806. Association for Computing Machinery, 2024

work page 2024
[16]

Towards agentic recommender systems in the era of multimodal large language models.arXiv preprint arXiv:2503.16734, 2025

Chengkai Huang, Junda Wu, Yu Xia, et al. Towards agentic recommender systems in the era of multimodal large language models.arXiv preprint arXiv:2503.16734, 2025

work page arXiv 2025
[17]

A survey on llm-powered agents for recommender systems.arXiv preprint arXiv:2502.10050, 2025

Qiyao Peng, Hongtao Liu, Hua Huang, Qing Yang, and Minglai Shao. A survey on llm-powered agents for recommender systems.arXiv preprint arXiv:2502.10050, 2025

work page arXiv 2025
[18]

Slateq: A tractable decomposition for reinforcement learning with recommendation sets

Eugene Ie, Vihan Jain, Jing Wang, Sanmit Narvekar, Ritesh Agarwal, Rui Wu, Heng-Tze Cheng, Morgane Lustman, Vince Gatto, Paul Covington, Jim McFadden, Tushar Chandra, and Craig Boutilier. Slateq: A tractable decomposition for reinforcement learning with recommendation sets. InProceedings of the Twenty-Eighth International Joint Conference on Artificial In...

work page 2019
[19]

Slate-aware ranking for recommendation

Yi Ren, Xiao Han, Xu Zhao, Shenzheng Zhang, and Yan Zhang. Slate-aware ranking for recommendation. InProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pages 499–507. Association for Computing Machinery, 2023. 11

work page 2023
[20]

Generative slate recommendation with reinforcement learning

Romain Deffayet, Thibault Thonet, Walid Bendada, Guillaume Bisson, Fabrice Popineau, Lynda Tamine, and Jefrey Lijffijt. Generative slate recommendation with reinforcement learning. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pages 580–588. Association for Computing Machinery, 2023

work page 2023
[21]

Generating and personalizing bundle recommendations on steam

Apurva Pathak, Kshitiz Gupta, and Julian McAuley. Generating and personalizing bundle recommendations on steam. InProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1073–1076. Association for Computing Machinery, 2017

work page 2017
[22]

Matching user with item set: Collaborative bundle recommendation with deep attention network

Liang Chen, Yang Liu, Xiangnan He, Lianli Gao, and Zibin Zheng. Matching user with item set: Collaborative bundle recommendation with deep attention network. InProceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, pages 2095–2101. International Joint Conferences on Artificial Intelligence Organization, 2019

work page 2095
[23]

Bundle recommendation with graph convolutional networks

Jianxin Chang, Chen Gao, Xiangnan He, Depeng Jin, and Yong Li. Bundle recommendation with graph convolutional networks. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1673–1676. Association for Computing Machinery, 2020

work page 2020
[24]

Diversity in recommender systems–a survey.Knowledge- Based Systems, 123:154–162, 2017

Matevž Kunaver and Tomaž Požrl. Diversity in recommender systems–a survey.Knowledge- Based Systems, 123:154–162, 2017

work page 2017
[25]

Marius Kaminskas and Derek Bridge. Diversity, serendipity, novelty, and coverage: A sur- vey and empirical analysis of beyond-accuracy objectives in recommender systems.ACM Transactions on Interactive Intelligent Systems, 7(1), 2016

work page 2016
[26]

Inferring networks of substitutable and complementary products

Julian McAuley et al. Inferring networks of substitutable and complementary products. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015

work page 2015
[27]

P-companion: A principled framework for diversified complementary product recom- mendation

Junheng Hao, Tong Zhao, Jin Li, Xin Luna Dong, Christos Faloutsos, Yizhou Sun, and Wei Wang. P-companion: A principled framework for diversified complementary product recom- mendation. InProceedings of the 29th ACM International Conference on Information and Knowledge Management, pages 2517–2524, 2020

work page 2020
[28]

Is it really complementary? revisiting behavior-based labels for complementary recommendation

Kai Sugahara, Chihiro Yamasaki, and Kazushi Okamoto. Is it really complementary? revisiting behavior-based labels for complementary recommendation. InProceedings of the 18th ACM Conference on Recommender Systems (RecSys), 2024

work page 2024
[29]

Explainable recommendation: A survey and new perspectives

Yongfeng Zhang and Xu Chen. Explainable recommendation: A survey and new perspectives. Foundations and Trends® in Information Retrieval, 14(1):1–101, 2020

work page 2020
[30]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. InAdvances in Neural Information Processing Systems, volume 36, 2024

work page 2024
[31]

Humans or llms as the judge? a study on judgement biases.arXiv preprint arXiv:2402.10669, 2024

Haoning Wu et al. Humans or llms as the judge? a study on judgement biases.arXiv preprint arXiv:2402.10669, 2024

work page arXiv 2024
[32]

Can LLM be a personalized judge? In Findings of the Association for Computational Linguistics: EMNLP 2024, 2024

Yijiang River Dong, Tiancheng Hu, and Nigel Collier. Can LLM be a personalized judge? In Findings of the Association for Computational Linguistics: EMNLP 2024, 2024

work page 2024
[33]

No free labels: Limitations of llm-as-a-judge without human grounding.arXiv preprint arXiv:2503.05061,

Michael Krumdick, Charles Lovering, Varshini Reddy, Seth Ebner, and Chris Tanner. No free labels: Limitations of LLM-as-a-judge without human grounding.arXiv preprint arXiv:2503.05061, 2025

work page arXiv 2025
[34]

RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising

David Rohde, Stephen Bonner, Travis Dunlop, Flavian Vasile, and Alexandros Karatzoglou. Recogym: A reinforcement learning environment for the problem of product recommendation in online advertising.arXiv preprint arXiv:1808.00720, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[35]

Recsim: A configurable simulation platform for recommender systems

Eugene Ie, Chih-Wei Hsu, Martin Mladenov, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu, and Craig Boutilier. Recsim: A configurable simulation platform for recommender systems. arXiv preprint arXiv:1909.04847, 2019. 12

work page arXiv 1909
[36]

Recsim NG: Toward principled uncertainty modeling for recommender ecosystems.arXiv preprint arXiv:2103.08057, 2021

Martin Mladenov, Chih-Wei Hsu, Vihan Jain, Eugene Ie, Christopher Colby, Nicolas Mayoraz, Hubert Pham, Dustin Tran, Ivan Vendrov, and Craig Boutilier. Recsim NG: Toward principled uncertainty modeling for recommender ecosystems.arXiv preprint arXiv:2103.08057, 2021

work page arXiv 2021
[37]

Kuaisim: A comprehensive simulator for recommender systems.Advances in Neural Information Processing Systems, 36:44880–44897, 2023

Kesen Zhao, Shuchang Liu, Qingpeng Cai, Xiangyu Zhao, Ziru Liu, Dong Zheng, Peng Jiang, and Kun Gai. Kuaisim: A comprehensive simulator for recommender systems.Advances in Neural Information Processing Systems, 36:44880–44897, 2023

work page 2023
[38]

Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

Yupeng Hou, Jiacheng Li, Zhankui He, An Yan, Xiusi Chen, and Julian McAuley. Bridging language and items for retrieval and recommendation.arXiv preprint arXiv:2403.03952, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Introducing GPT-4.1 in the API, April 2025

OpenAI. Introducing GPT-4.1 in the API, April 2025. Includes GPT-4.1 mini

work page 2025
[40]

Grok 4.1 fast and agent tools API, November 2025

xAI. Grok 4.1 fast and agent tools API, November 2025

work page 2025
[41]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

work page 2025
[42]

Claude Haiku 4.5 system card, October 2025

Anthropic. Claude Haiku 4.5 system card, October 2025

work page 2025
[43]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. Qwen3 technical report, 2025

work page 2025
[44]

Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay.arXiv preprint arXiv:2504.03601, 2025

Akshara Prabhakar, Zuxin Liu, Ming Zhu, Jianguo Zhang, Tulika Awalgaonkar, Shiyu Wang, Zhiwei Liu, Haolin Chen, Thai Hoang, et al. APIGen-MT: Agentic pipeline for multi-turn data generation via simulated agent-human interplay.arXiv preprint arXiv:2504.03601, 2025

work page arXiv 2025
[45]

Llama-Nemotron: Efficient reasoning models, 2025

Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, et al. Llama-Nemotron: Efficient reasoning models, 2025. 13 A Benchmark artifact build RecoAtlasbuilds one category-specific environment per Amazon product category. Each environment consists of a filtered catalog I, product texts xi for every item i∈ I , subcategory metada...

work page 2025
[46]

for live performance

In the experiments, we use subcategory-adversarial corruption. For search_products and get_complementary_products, let K be the number of clean returned slots and ρ∈[0,1] the faulty-tool rate. The corruption selects round(Kρ) slots uniformly without replacement. Each corrupted slot keeps its displayed score, but its item ID, title, and description are rep...

work page

[1] [1]

Agentrecbench: Benchmarking llm agent-based personalized recommender systems.arXiv preprint arXiv:2505.19623, 2025

Yu Shang, Peijie Liu, Yuwei Yan, Zijing Wu, Leheng Sheng, Yuanqing Yu, Chumeng Jiang, An Zhang, Fengli Xu, Yu Wang, Min Zhang, and Yong Li. Agentrecbench: Benchmarking llm agent-based personalized recommender systems.arXiv preprint arXiv:2505.19623, 2025

work page arXiv 2025

[2] [2]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating LLMs as agents.arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Roberto Dessì, Maria Lomeli, David Esiobu, Yizhong Chen, Kai Arulkumaran, Antoine Cully, Cyprien de Masson d’Autume, Hadi Eshragh, Nacér Hassen, Zachary Kenton, Andrew Li, Aravind Mahendran, Daniel Mankowitz, Piotr Mirowski, Anna Rogers, Hubert Soyer, Nino Vieillard, Martha White, Yuting Yang, et al. GAIA: A benchmark for general AI assis...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

On the tool manipulation capability of open-source large language models.arXiv preprint arXiv:2305.16504, 2023

Qiantong Xu, Fenglu Hong, Bo Li, Changran Hu, Zhengyu Chen, and Jian Zhang. On the tool manipulation capability of open-source large language models.arXiv preprint arXiv:2305.16504, 2023

work page arXiv 2023

[6] [6]

Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models

Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu. Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11143–11156, Bangkok, Thailand, 2024. Association for Computational L...

work page 2024

[7] [7]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis.arXiv preprint arXiv:2305.15334, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Webshop: Towards scalable real-world web interaction with grounded language agents.arXiv preprint arXiv:2207.01206, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.arXiv preprint arXiv:2207.01206, 2022

work page arXiv 2022

[9] [9]

Shopping mmlu: A massive multi-task online shopping benchmark for large language models.arXiv preprint arXiv:2410.20745, 2024

Yilun Jin, Zheng Li, Chenwei Zhang, Tianyu Cao, Yifan Gao, Pratik Jayarao, Mao Li, Xin Liu, Ritesh Sarkhel, Xianfeng Tang, Haodong Wang, Zhengyang Wang, Wenju Xu, Jingfeng Yang, Qingyu Yin, Xian Li, Priyanka Nigam, Yi Xu, Kai Chen, Qiang Yang, Meng Jiang, and Bing Yin. Shopping mmlu: A massive multi-task online shopping benchmark for large language models...

work page arXiv 2024

[10] [10]

Shoppingbench: A real-world intent-grounded shopping benchmark for llm-based agents

Jiangyuan Wang, Kejun Xiao, Qi Sun, Huaipeng Zhao, Tao Luo, Jian Dong Zhang, and Xiaoyi Zeng. Shoppingbench: A real-world intent-grounded shopping benchmark for llm-based agents. arXiv preprint arXiv:2508.04266, 2025

work page arXiv 2025

[11] [11]

WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents

Ralph Peeters, Aaron Steiner, Luca Schwarz, Julian Yuya Caspary, and Christian Bizer. Webmall – a multi-shop benchmark for evaluating web agents.arXiv preprint arXiv:2508.13024, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Shoppingcomp: Are llms really ready for your shopping cart?arXiv preprint arXiv:2511.22978, 2025

Huaixiao Tou, Ying Zeng, Cong Ma, Muzhi Li, Minghao Li, Weijie Yuan, He Zhang, and Kai Jia. Shoppingcomp: Are llms really ready for your shopping cart?arXiv preprint arXiv:2511.22978, 2025

work page arXiv 2025

[13] [13]

Deepshop: A benchmark for deep research shopping agents.arXiv preprint arXiv:2506.02839, 2025

Yougang Lyu, Xiaoyu Zhang, Lingyong Yan, Maarten de Rijke, Zhaochun Ren, and Xi- uying Chen. Deepshop: A benchmark for deep research shopping agents.arXiv preprint arXiv:2506.02839, 2025

work page arXiv 2025

[14] [14]

Recmind: Large language model powered agent for recommendation.arXiv preprint arXiv:2308.14296, 2023

Yancheng Wang, Ziyan Jiang, Zheng Chen, Fan Yang, Yingxue Zhou, Eunah Cho, Xing Fan, Xiaojiang Huang, Yanbin Lu, and Yingzhen Yang. Recmind: Large language model powered agent for recommendation.arXiv preprint arXiv:2308.14296, 2023

work page arXiv 2023

[15] [15]

Let me do it for you: Towards llm empowered recommendation via tool learning

Yuyue Zhao, Jiancan Wu, Xiang Wang, Wei Tang, Dingxian Wang, and Maarten de Rijke. Let me do it for you: Towards llm empowered recommendation via tool learning. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1796–1806. Association for Computing Machinery, 2024

work page 2024

[16] [16]

Towards agentic recommender systems in the era of multimodal large language models.arXiv preprint arXiv:2503.16734, 2025

Chengkai Huang, Junda Wu, Yu Xia, et al. Towards agentic recommender systems in the era of multimodal large language models.arXiv preprint arXiv:2503.16734, 2025

work page arXiv 2025

[17] [17]

A survey on llm-powered agents for recommender systems.arXiv preprint arXiv:2502.10050, 2025

Qiyao Peng, Hongtao Liu, Hua Huang, Qing Yang, and Minglai Shao. A survey on llm-powered agents for recommender systems.arXiv preprint arXiv:2502.10050, 2025

work page arXiv 2025

[18] [18]

Slateq: A tractable decomposition for reinforcement learning with recommendation sets

Eugene Ie, Vihan Jain, Jing Wang, Sanmit Narvekar, Ritesh Agarwal, Rui Wu, Heng-Tze Cheng, Morgane Lustman, Vince Gatto, Paul Covington, Jim McFadden, Tushar Chandra, and Craig Boutilier. Slateq: A tractable decomposition for reinforcement learning with recommendation sets. InProceedings of the Twenty-Eighth International Joint Conference on Artificial In...

work page 2019

[19] [19]

Slate-aware ranking for recommendation

Yi Ren, Xiao Han, Xu Zhao, Shenzheng Zhang, and Yan Zhang. Slate-aware ranking for recommendation. InProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pages 499–507. Association for Computing Machinery, 2023. 11

work page 2023

[20] [20]

Generative slate recommendation with reinforcement learning

Romain Deffayet, Thibault Thonet, Walid Bendada, Guillaume Bisson, Fabrice Popineau, Lynda Tamine, and Jefrey Lijffijt. Generative slate recommendation with reinforcement learning. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pages 580–588. Association for Computing Machinery, 2023

work page 2023

[21] [21]

Generating and personalizing bundle recommendations on steam

Apurva Pathak, Kshitiz Gupta, and Julian McAuley. Generating and personalizing bundle recommendations on steam. InProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1073–1076. Association for Computing Machinery, 2017

work page 2017

[22] [22]

Matching user with item set: Collaborative bundle recommendation with deep attention network

Liang Chen, Yang Liu, Xiangnan He, Lianli Gao, and Zibin Zheng. Matching user with item set: Collaborative bundle recommendation with deep attention network. InProceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, pages 2095–2101. International Joint Conferences on Artificial Intelligence Organization, 2019

work page 2095

[23] [23]

Bundle recommendation with graph convolutional networks

Jianxin Chang, Chen Gao, Xiangnan He, Depeng Jin, and Yong Li. Bundle recommendation with graph convolutional networks. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1673–1676. Association for Computing Machinery, 2020

work page 2020

[24] [24]

Diversity in recommender systems–a survey.Knowledge- Based Systems, 123:154–162, 2017

Matevž Kunaver and Tomaž Požrl. Diversity in recommender systems–a survey.Knowledge- Based Systems, 123:154–162, 2017

work page 2017

[25] [25]

Marius Kaminskas and Derek Bridge. Diversity, serendipity, novelty, and coverage: A sur- vey and empirical analysis of beyond-accuracy objectives in recommender systems.ACM Transactions on Interactive Intelligent Systems, 7(1), 2016

work page 2016

[26] [26]

Inferring networks of substitutable and complementary products

Julian McAuley et al. Inferring networks of substitutable and complementary products. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015

work page 2015

[27] [27]

P-companion: A principled framework for diversified complementary product recom- mendation

Junheng Hao, Tong Zhao, Jin Li, Xin Luna Dong, Christos Faloutsos, Yizhou Sun, and Wei Wang. P-companion: A principled framework for diversified complementary product recom- mendation. InProceedings of the 29th ACM International Conference on Information and Knowledge Management, pages 2517–2524, 2020

work page 2020

[28] [28]

Is it really complementary? revisiting behavior-based labels for complementary recommendation

Kai Sugahara, Chihiro Yamasaki, and Kazushi Okamoto. Is it really complementary? revisiting behavior-based labels for complementary recommendation. InProceedings of the 18th ACM Conference on Recommender Systems (RecSys), 2024

work page 2024

[29] [29]

Explainable recommendation: A survey and new perspectives

Yongfeng Zhang and Xu Chen. Explainable recommendation: A survey and new perspectives. Foundations and Trends® in Information Retrieval, 14(1):1–101, 2020

work page 2020

[30] [30]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. InAdvances in Neural Information Processing Systems, volume 36, 2024

work page 2024

[31] [31]

Humans or llms as the judge? a study on judgement biases.arXiv preprint arXiv:2402.10669, 2024

Haoning Wu et al. Humans or llms as the judge? a study on judgement biases.arXiv preprint arXiv:2402.10669, 2024

work page arXiv 2024

[32] [32]

Can LLM be a personalized judge? In Findings of the Association for Computational Linguistics: EMNLP 2024, 2024

Yijiang River Dong, Tiancheng Hu, and Nigel Collier. Can LLM be a personalized judge? In Findings of the Association for Computational Linguistics: EMNLP 2024, 2024

work page 2024

[33] [33]

No free labels: Limitations of llm-as-a-judge without human grounding.arXiv preprint arXiv:2503.05061,

Michael Krumdick, Charles Lovering, Varshini Reddy, Seth Ebner, and Chris Tanner. No free labels: Limitations of LLM-as-a-judge without human grounding.arXiv preprint arXiv:2503.05061, 2025

work page arXiv 2025

[34] [34]

RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising

David Rohde, Stephen Bonner, Travis Dunlop, Flavian Vasile, and Alexandros Karatzoglou. Recogym: A reinforcement learning environment for the problem of product recommendation in online advertising.arXiv preprint arXiv:1808.00720, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[35] [35]

Recsim: A configurable simulation platform for recommender systems

Eugene Ie, Chih-Wei Hsu, Martin Mladenov, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu, and Craig Boutilier. Recsim: A configurable simulation platform for recommender systems. arXiv preprint arXiv:1909.04847, 2019. 12

work page arXiv 1909

[36] [36]

Recsim NG: Toward principled uncertainty modeling for recommender ecosystems.arXiv preprint arXiv:2103.08057, 2021

Martin Mladenov, Chih-Wei Hsu, Vihan Jain, Eugene Ie, Christopher Colby, Nicolas Mayoraz, Hubert Pham, Dustin Tran, Ivan Vendrov, and Craig Boutilier. Recsim NG: Toward principled uncertainty modeling for recommender ecosystems.arXiv preprint arXiv:2103.08057, 2021

work page arXiv 2021

[37] [37]

Kuaisim: A comprehensive simulator for recommender systems.Advances in Neural Information Processing Systems, 36:44880–44897, 2023

Kesen Zhao, Shuchang Liu, Qingpeng Cai, Xiangyu Zhao, Ziru Liu, Dong Zheng, Peng Jiang, and Kun Gai. Kuaisim: A comprehensive simulator for recommender systems.Advances in Neural Information Processing Systems, 36:44880–44897, 2023

work page 2023

[38] [38]

Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

Yupeng Hou, Jiacheng Li, Zhankui He, An Yan, Xiusi Chen, and Julian McAuley. Bridging language and items for retrieval and recommendation.arXiv preprint arXiv:2403.03952, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Introducing GPT-4.1 in the API, April 2025

OpenAI. Introducing GPT-4.1 in the API, April 2025. Includes GPT-4.1 mini

work page 2025

[40] [40]

Grok 4.1 fast and agent tools API, November 2025

xAI. Grok 4.1 fast and agent tools API, November 2025

work page 2025

[41] [41]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

work page 2025

[42] [42]

Claude Haiku 4.5 system card, October 2025

Anthropic. Claude Haiku 4.5 system card, October 2025

work page 2025

[43] [43]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. Qwen3 technical report, 2025

work page 2025

[44] [44]

Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay.arXiv preprint arXiv:2504.03601, 2025

Akshara Prabhakar, Zuxin Liu, Ming Zhu, Jianguo Zhang, Tulika Awalgaonkar, Shiyu Wang, Zhiwei Liu, Haolin Chen, Thai Hoang, et al. APIGen-MT: Agentic pipeline for multi-turn data generation via simulated agent-human interplay.arXiv preprint arXiv:2504.03601, 2025

work page arXiv 2025

[45] [45]

Llama-Nemotron: Efficient reasoning models, 2025

Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, et al. Llama-Nemotron: Efficient reasoning models, 2025. 13 A Benchmark artifact build RecoAtlasbuilds one category-specific environment per Amazon product category. Each environment consists of a filtered catalog I, product texts xi for every item i∈ I , subcategory metada...

work page 2025

[46] [46]

for live performance

In the experiments, we use subcategory-adversarial corruption. For search_products and get_complementary_products, let K be the number of clean returned slots and ρ∈[0,1] the faulty-tool rate. The corruption selects round(Kρ) slots uniformly without replacement. Each corrupted slot keeps its displayed score, but its item ID, title, and description are rep...

work page