VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

An Zhang; Chenhang Cui; Jingnan Zheng; Qi Gu; Tat-Seng Chua; Xiang Wang; Xi Su; Xunliang Cai; Yaorui Shi; Yaqi Huo

arxiv: 2605.27141 · v1 · pith:V4WBGVA5new · submitted 2026-05-26 · 💻 cs.AI

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

Yuxin Chen , Yi Zhang , Zhengzhou Cai , Yaorui Shi , Zhiyuan Yao , Chenhang Cui , Jingnan Zheng , Yaqi Huo

show 6 more authors

Xi Su Qi Gu Xunliang Cai Xiang Wang An Zhang Tat-Seng Chua

This is my paper

Pith reviewed 2026-06-29 17:07 UTC · model grok-4.3

classification 💻 cs.AI

keywords personalized agentsproactive agentslong-term interactionsLLM benchmarksuser preference inferencememory architecturesagent evaluation

0 comments

The pith

VitaBench 2.0 shows that state-of-the-art LLMs still struggle to personalize and act proactively in long-term user interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VitaBench 2.0 to test how well LLM agents infer and apply user preferences from fragmented, temporally ordered interactions across individual users. Tasks require continuous extraction, updating, and use of preferences plus proactive steps to gather missing information before decisions. Benchmark results on frontier models indicate these abilities remain highly challenging and fall short of practical needs. An extensible memory interface is provided to enable systematic comparisons of architectures, and the work catalogs observed failure modes in preference handling.

Core claim

VitaBench 2.0 organizes evaluation tasks as temporally ordered sequences for individual users in which preferences are embedded in fragmented and heterogeneous interactions; successful performance demands that agents continuously extract, utilize, and update those preferences while also recognizing and acquiring missing information from users or environments, and frontier models exhibit a substantial gap from the required capabilities.

What carries the argument

The VitaBench 2.0 benchmark of user-specific temporal task sequences with embedded preferences, combined with proactiveness tasks that test acquisition of missing information and an extensible memory interface for architecture comparisons.

Load-bearing premise

The constructed tasks and preference embeddings accurately capture the inference and proactivity demands of genuine long-term user interactions.

What would settle it

A direct comparison in which agents that perform well on VitaBench 2.0 are deployed in real extended user sessions and their personalization accuracy or user satisfaction is measured against benchmark scores.

Figures

Figures reproduced from arXiv: 2605.27141 by An Zhang, Chenhang Cui, Jingnan Zheng, Qi Gu, Tat-Seng Chua, Xiang Wang, Xi Su, Xunliang Cai, Yaorui Shi, Yaqi Huo, Yi Zhang, Yuxin Chen, Zhengzhou Cai, Zhiyuan Yao.

**Figure 3.** Figure 3: Average performance across tasks at each temporal task index. 4 Experiment 4.1 Experimental Setups Models. We evaluate a diverse set of state-of-the-art proprietary and open LLMs, covering both non-thinking and thinking configurations when available. The evaluated models include OpenAI family, including GPT-3.5-Turbo, GPT-4o-mini, GPT-5, and o-series models such as o3 and o4- mini [58, 3, 59–61]; the DeepS… view at source ↗

**Figure 4.** Figure 4: Analysis of model behavior on VitaBench 2.0. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Failure pattern statistics for DeepSeek-V4-Pro and DeepSeek-R1. Category A denotes [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of user profile statistics in VitaBench 2.0. [PITH_FULL_IMAGE:figures/full_fig_p037_6.png] view at source ↗

**Figure 7.** Figure 7: Overview of user preference statistics in VitaBench 2.0. [PITH_FULL_IMAGE:figures/full_fig_p038_7.png] view at source ↗

read the original abstract

Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. However, existing agent benchmarks primarily evaluate reasoning and tool use, largely overlooking the challenges of inferring and leveraging user preferences in realistic scenarios. To address this gap, we introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions. In VitaBench 2.0, tasks are organized as temporally ordered sequences for individual users, where preferences are embedded in fragmented and heterogeneous interactions. Successful completion of tasks requires the agent to continuously extract, utilize, and update user preferences from these interactions. We further evaluate proactiveness through tasks that require agents to recognize missing information and actively acquire it from users or environments before making decisions. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across different memory architectures. We benchmark a diverse set of frontier proprietary and open-source LLMs. Results show that real-world personalization remains highly challenging even for state-of-the-art models, revealing a substantial gap between current capabilities and practical requirements. Extensive analysis further reveals the failure modes and capability bottlenecks of current agents in real-world personalized decision-making, providing insights for future model improvements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VitaBench 2.0 supplies a benchmark for long-term preference inference and proactivity that existing agent suites largely skip, and the reported model gaps look plausible on the surface.

read the letter

VitaBench 2.0 sets up tasks as temporally ordered user sequences where preferences appear in fragmented interactions, then adds separate proactivity tests that force agents to notice missing information and seek it out. It also ships an extensible memory interface so different architectures can be swapped in for controlled comparisons. That combination is the actual addition; most prior benchmarks stop at single-turn reasoning or tool calls.

The paper does the straightforward thing of running frontier models on these tasks and documenting that personalization stays hard. The failure-mode analysis is the part that could be useful to people actually building agents, because it points to specific bottlenecks rather than just reporting aggregate scores.

The soft spot is the task construction itself. The abstract describes embedding preferences in heterogeneous interactions, but without the full details on how those embeddings are generated, validated, or scored, it is difficult to judge whether the difficulty is coming from the intended capability or from quirks in how the sequences were built. The stress-test note says the argument stays internally consistent once the methods are laid out, and that seems right, but the strength of the claim still hinges on those sections.

This is aimed at researchers working on memory, long-horizon agents, or personalization layers. Anyone already running their own long-term interaction experiments would get concrete value from the interface and the baseline numbers. It is not a foundational theoretical paper, but the empirical gap it documents is worth having a public yardstick for.

I would send it to peer review. The contribution is scoped and the evaluation is on real models, so referees can check the construction details and decide how much weight to give the results.

Referee Report

0 major / 0 minor

Summary. The paper introduces VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions. Tasks are organized as temporally ordered sequences for individual users with preferences embedded in fragmented and heterogeneous interactions; successful completion requires agents to continuously extract, utilize, and update user preferences. Proactiveness is evaluated via tasks that require recognizing missing information and actively acquiring it from users or environments. An extensible memory interface supports controlled comparisons across memory architectures. Benchmarking of frontier proprietary and open-source LLMs shows that real-world personalization remains highly challenging, revealing a substantial gap between current capabilities and practical requirements, along with analysis of failure modes.

Significance. If the benchmark construction and evaluation protocol hold, the work fills a clear gap in existing agent benchmarks by targeting long-term personalization and proactivity rather than isolated reasoning or tool use. The extensible memory interface is a concrete strength that enables systematic ablation across architectures. The paper supplies the task-generation procedure, memory interface, and evaluation protocol, supporting reproducibility; the scoped claim of a capability gap on this benchmark is internally consistent and provides actionable insights into bottlenecks for future model development.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of VitaBench 2.0, their recognition of the benchmark's contributions to long-term personalization and proactivity, and their recommendation to accept the manuscript.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces VitaBench 2.0 as an empirical benchmark for agent personalization and proactivity, with tasks organized as temporally ordered sequences embedding fragmented preferences, plus an extensible memory interface and evaluation protocol. The central claim—that state-of-the-art models exhibit a substantial gap on these tasks—is supported directly by reported benchmark results rather than any derivation, equation, fitted parameter, or self-citation chain that reduces to prior inputs. No load-bearing step equates a prediction to its own construction; the argument remains self-contained within the supplied task-generation and evaluation procedures.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work introduces a benchmark rather than a theoretical derivation; it relies on standard assumptions about user behavior and interaction modeling without introducing new free parameters, axioms beyond domain conventions, or invented physical entities.

pith-pipeline@v0.9.1-grok · 5830 in / 1132 out tokens · 25206 ms · 2026-06-29T17:07:24.381916+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

126 extracted references · 49 canonical work pages · 23 internal anchors

[1]

Deepseek-v3.1 model card

DeepSeekAI. Deepseek-v3.1 model card. 2025. URL https://huggingface.co/ deepseek-ai/DeepSeek-V3.1

2025
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Introducing gpt-5

OpenAI. Introducing gpt-5. 2025. URL https://openai.com/index/ introducing-gpt-5/

2025
[4]

Claude sonnet 4.5 model card

Anthropic. Claude sonnet 4.5 model card. 2025. URL https://www.anthropic.com/ news/claude-sonnet-4-5

2025
[5]

Longcat-flash-thinking-2601 technical report.CoRR, abs/2601.16725, 2026

Meituan LongCat Team. Longcat-flash-thinking-2601 technical report.CoRR, abs/2601.16725, 2026

work page arXiv 2026
[6]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu et al. Deepseek-v3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Qwen3-max model card

Qwen Team. Qwen3-max model card. 2025. URL https://qwen.ai/blog?id= qwen3-max

2025
[8]

A survey of personalized large language models: Progress and future directions.arXiv preprint arXiv:2502.11528, 2025

Jiahong Liu, Zexuan Qiu, Zhongyang Li, Quanyu Dai, Wenhao Yu, Jieming Zhu, Minda Hu, Menglin Yang, Tat-Seng Chua, and Irwin King. A survey of personalized large language models: Progress and future directions.arXiv preprint arXiv:2502.11528, 2025

work page arXiv 2025
[9]

PersonaMem-v2: Towards personalized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688, 2025

Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, et al. PersonaMem-v2: Towards personalized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688, 2025

work page arXiv 2025
[10]

Know me, respond to me: Benchmarking LLMs for dynamic user profiling and personalized responses at scale.arXiv preprint arXiv:2504.14225, 2025

Bowen Jiang, Zhuoqun Hao, Young-Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo J Taylor, and Dan Roth. Know me, respond to me: Benchmarking LLMs for dynamic user profiling and personalized responses at scale.arXiv preprint arXiv:2504.14225, 2025

work page arXiv 2025
[11]

KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

Tongbo Chen, Zhengxi Lu, Zhan Xu, Guocheng Shao, Shaohan Zhao, Fei Tang, Yong Du, Kaitao Song, Yizhou Liu, Yuchen Yan, et al. Knowu-bench: Towards interactive, proactive, and personalized mobile agent evaluation.arXiv preprint arXiv:2604.08455, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

SWE-bench: Can language models resolve real-world GitHub issues? In International Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In International Conference on Learning Representations, 2024

2024
[13]

Agent- Bench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, et al. Agent- Bench: Evaluating LLMs as agents. InInternational Conference on Learning Representations, 2024. 10

2024
[14]

WebArena: A realistic web environment for building autonomous agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, et al. WebArena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024

2024
[15]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.τ 2-bench: Eval- uating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

VitaBench: Benchmarking LLM agents with versatile interactive tasks in real-world applications.arXiv preprint arXiv:2509.26490, 2025

Wei He, Yueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, et al. VitaBench: Benchmarking LLM agents with versatile interactive tasks in real-world applications.arXiv preprint arXiv:2509.26490, 2025

work page arXiv 2025
[18]

Personalization of large language models: A survey.arXiv preprint arXiv:2411.00027, 2024

Zhehao Zhang, Ryan Lutz, Aidan Mao, Tianyue Bao, Zijian Wang, Zhoujian Zhao, Kaixin Xiang, Liwei Ding, Le Tong, Jiaxin Zhuo, et al. Personalization of large language models: A survey.arXiv preprint arXiv:2411.00027, 2024

work page arXiv 2024
[19]

Mem0: The memory layer for personalized AI.https://mem0.ai, 2024

Mem0. Mem0: The memory layer for personalized AI.https://mem0.ai, 2024

2024
[22]

Two tales of persona in LLMs: A survey of role-playing and personalization.arXiv preprint arXiv:2406.01171, 2024

Yu-Min Tseng, Yu-Chao Huang, Teng-Yun Hsiao, Wei-Lin Huang, et al. Two tales of persona in LLMs: A survey of role-playing and personalization.arXiv preprint arXiv:2406.01171, 2024

work page arXiv 2024
[23]

Optimization methods for personalizing large language models through retrieval augmentation

Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. Optimization methods for personalizing large language models through retrieval augmentation. InPro- ceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024

2024
[24]

PEARL: Personalizing large language model writing assistants with generation-calibrated retrievers

Sheshera Mysore, Zhuoran Lu, Mengting Wan, Julian McAuley, and Hamed Zamani. PEARL: Personalizing large language model writing assistants with generation-calibrated retrievers. In Proceedings of the 1st Workshop on Customizable NLP, 2024

2024
[25]

Integrating summa- rization and retrieval for enhanced personalization via large language models.arXiv preprint arXiv:2310.20081, 2023

Jesse Richardson, Kristen Bloom, Aggeliki Founta, and Brendan Mathew. Integrating summa- rization and retrieval for enhanced personalization via large language models.arXiv preprint arXiv:2310.20081, 2023

work page arXiv 2023
[26]

Teach LLMs to personalize–an approach inspired by writing education.arXiv preprint arXiv:2308.07968, 2023

Cheng Li, Mingyang Chen, Haoping Wang, Bin Zhu, Haoyu Luo, et al. Teach LLMs to personalize–an approach inspired by writing education.arXiv preprint arXiv:2308.07968, 2023

work page arXiv 2023
[27]

Understanding the role of user profile in the personalization of large language models.arXiv preprint arXiv:2406.17803, 2024

Ostap Wu, Max Haim, Tanmay Dey, et al. Understanding the role of user profile in the personalization of large language models.arXiv preprint arXiv:2406.17803, 2024

work page arXiv 2024
[28]

Democra- tizing large language models via personalized parameter-efficient fine-tuning

Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, Bing Yin, and Meng Jiang. Democra- tizing large language models via personalized parameter-efficient fine-tuning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

2024
[29]

PLoRA: Personalized low-rank adaptation for human- centered text understanding

Yuting Zhang, Yuliang Ding, et al. PLoRA: Personalized low-rank adaptation for human- centered text understanding. InProceedings of the AAAI Conference on Artificial Intelligence, 2024

2024
[30]

HYDRA: Model factorization framework for black-box LLM personalization.arXiv preprint arXiv:2406.02888, 2024

Tao Zhuang, Xin Wang, Zhirui Yuan, et al. HYDRA: Model factorization framework for black-box LLM personalization.arXiv preprint arXiv:2406.02888, 2024. 11

work page arXiv 2024
[31]

Personalized soups: Per- sonalized large language model alignment via post-hoc parameter merging.arXiv preprint arXiv:2310.11564, 2023

Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Shafran, Yejin Choi, et al. Personalized soups: Personalized large language model alignment via post-hoc parameter merging.arXiv preprint arXiv:2310.11564, 2023

work page arXiv 2023
[32]

Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization.arXiv preprint arXiv:2310.03708, 2023

Zhanhui Zhou, Jie Liu, Jing Dong, Jiaheng Yang, et al. Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization.arXiv preprint arXiv:2310.03708, 2023

work page arXiv 2023
[33]

NextQuill: Causal preference modeling for enhancing LLM personal- ization.arXiv preprint arXiv:2506.02368, 2025

Xiaoyan Zhao, Juntao You, Yang Zhang, Wenjie Wang, Hong Cheng, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. NextQuill: Causal preference modeling for enhancing LLM personal- ization.arXiv preprint arXiv:2506.02368, 2025

work page arXiv 2025
[34]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, et al. A-MEM: Agentic memory for LLM agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

PersonalLLM: Tailoring LLMs to individual preferences.arXiv preprint arXiv:2409.20296, 2024

Thomas P Zollo, Andrew Weidinger, et al. PersonalLLM: Tailoring LLMs to individual preferences.arXiv preprint arXiv:2409.20296, 2024

work page arXiv 2024
[37]

Do LLMs recognize your preferences? evaluating personalized preference following in LLMs

Xiaoyan Zhao, Yang Zhang, Juntao You, Wenjie Wang, Fuli Feng, et al. Do LLMs recognize your preferences? evaluating personalized preference following in LLMs. InInternational Conference on Learning Representations, 2025

2025
[38]

PersonaBench: Evaluating AI models on understanding personal informa- tion through accessing (synthetic) private user data

Zhaoxuan Tan et al. PersonaBench: Evaluating AI models on understanding personal informa- tion through accessing (synthetic) private user data. InInternational Conference on Learning Representations, 2025

2025
[39]

LaMP: When large language models meet personalization

Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. LaMP: When large language models meet personalization. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

2024
[40]

Jiongnan Liu et al

Ishita Kumar, Snigdha Viswanathan, et al. LongLaMP: A benchmark for personalized long- form text generation.arXiv preprint arXiv:2407.11016, 2024

work page arXiv 2024
[41]

Evaluating very long-term conversational memory of LLM agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

2024
[42]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Wu, Kai Yu, et al. LongMemEval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

MemSim: A Bayesian simulator for evaluating memory of personal assistants.arXiv preprint arXiv:2409.20163, 2024

Zeyu Zhang et al. MemSim: A Bayesian simulator for evaluating memory of personal assistants.arXiv preprint arXiv:2409.20163, 2024

work page arXiv 2024
[44]

AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

Jianfei Xiao, Xiang Yu, Chengbing Wang, Wuqiang Zheng, Xinyu Lin, Kaining Liu, Hongxun Ding, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. AlpsBench: An LLM personalization benchmark for real-dialogue memorization and preference alignment.arXiv preprint arXiv:2603.26680, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, 2023

2023
[46]

API-Bank: A comprehensive benchmark for tool-augmented LLMs

Minghao Li, Feifan Song, Bowen Yu, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-Bank: A comprehensive benchmark for tool-augmented LLMs. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

2023
[47]

Gorilla: Large language model connected with massive APIs

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive APIs. InAdvances in Neural Information Processing Systems, 2024. 12

2024
[48]

ToolTalk: Evaluating tool-usage in a conversational setting

Nicholas Farn and Richard Shin. ToolTalk: Evaluating tool-usage in a conversational setting. arXiv preprint arXiv:2311.10775, 2023

work page arXiv 2023
[49]

MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback

Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback. In International Conference on Learning Representations, 2024

2024
[50]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities.arXiv preprint arXiv:2408.04682, 2024

Jiarui Lu, Thomas Zhu, Hao Jiang, Marta Skreta, Arun Sai Rawat, et al. ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities.arXiv preprint arXiv:2408.04682, 2024

work page arXiv 2024
[52]

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

Wentao Shi, Yu Wang, Yuyang Zhao, Yuxin Chen, Fuli Feng, Xueyuan Hao, Xi Su, Qi Gu, Hui Su, Xunliang Cai, et al. Aj-bench: Benchmarking agent-as-a-judge for environment-aware evaluation.arXiv preprint arXiv:2604.18240, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[53]

OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information Processing Systems, 2024

2024
[54]

AppWorld: A controllable world of apps and people for benchmarking interactive coding agents

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Reshef Manber, Vinty Baber, David Fishi, et al. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

2024
[55]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, et al. SkillsBench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[56]

Agentnoisebench: Benchmarking robustness of tool-using llm agents under noisy condition.arXiv preprint arXiv:2602.11348, 2026

Ruipeng Wang, Yuxin Chen, Yukai Wang, Chang Wu, Junfeng Fang, Xiaodong Cai, Qi Gu, Hui Su, An Zhang, Xiang Wang, et al. Agentnoisebench: Benchmarking robustness of tool-using llm agents under noisy condition.arXiv preprint arXiv:2602.11348, 2026

work page arXiv 2026
[57]

Risky-bench: Probing agentic safety risks under real-world deployment.arXiv preprint arXiv:2602.03100, 2026

Jingnan Zheng, Yanzhen Luo, Jingjun Xu, Bingnan Liu, Yuxin Chen, Chenhang Cui, Gelei Deng, Chaochao Lu, Xiang Wang, An Zhang, et al. Risky-bench: Probing agentic safety risks under real-world deployment.arXiv preprint arXiv:2602.03100, 2026

work page arXiv 2026
[58]

Introducing gpt-4.1 in the api

OpenAI. Introducing gpt-4.1 in the api. 2025. URL https://openai.com/index/ gpt-4-1/

2025
[59]

Introducing gpt-5.1

OpenAI. Introducing gpt-5.1. 2025. URLhttps://openai.com/index/gpt-5-1/

2025
[60]

Introducing gpt-5.2

OpenAI. Introducing gpt-5.2. 2025. URL https://openai.com/index/ introducing-gpt-5-2/

2025
[61]

Introducing o3 and o4-mini

OpenAI. Introducing o3 and o4-mini. 2025. URL https://openai.com/index/ introducing-o3-and-o4-mini/

2025
[62]

Deepseek-v4 model card

DeepSeekAI. Deepseek-v4 model card. 2026. URL huggingface.co/deepseek-ai/ DeepSeek-V4-Pro

2026
[63]

Claude sonnet 4 system card

Anthropic. Claude sonnet 4 system card. 2025. URL https://www.anthropic.com/news/ claude-4

2025
[64]

Claude opus 4.6 system card

Anthropic. Claude opus 4.6 system card. 2026. URL https://www.anthropic.com/ claude-opus-4-6-system-card

2026
[65]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici et al. Gemini 2.5: Advanced reasoning, multimodality, and agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Gemini 2.5 pro model card

Google. Gemini 2.5 pro model card. 2025. URL https://modelcards.withgoogle.com/ assets/documents/gemini-2.5-pro.pdf. 13

2025
[67]

Gemini 2.5 flash model card

Google. Gemini 2.5 flash model card. 2025. URL https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-2-5-Flash-Model-Card.pdf

2025
[68]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Aohan Zeng et al. Glm-4.5: Agentic, reasoning, and coding foundation models.arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

Glm-4.6 technical blog

Z.ai. Glm-4.6 technical blog. 2025. URLhttps://z.ai/blog/glm-4.6

2025
[70]

GLM-5.1 model card

Z.ai. GLM-5.1 model card. 2026. URLhttps://huggingface.co/zai-org/GLM-5.1

2026
[71]

Seed 1.6 technical introduction

ByteDance. Seed 1.6 technical introduction. 2025. URL https://seed.bytedance.com/ en/seed1_6

2025
[72]

Seed 2.0 model card: Towards intelligence frontier for real-world complexity

ByteDance Seed. Seed 2.0 model card: Towards intelligence frontier for real-world complexity
[73]

URLseed.bytedance.com/en/seed2
[74]

Kimi-K2.6 model card

Moonshot AI. Kimi-K2.6 model card. 2026. URL https://huggingface.co/ moonshotai/Kimi-K2.6

2026
[75]

Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

Meituan LongCat Team. Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

work page arXiv 2025
[76]

MiniMax-M2.7: Model self-improvement, driving productivity innovation through technological breakthroughs

MiniMax. MiniMax-M2.7: Model self-improvement, driving productivity innovation through technological breakthroughs. 2026. URLhttps://www.minimax.io/models/text/m27

2026
[77]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations (ICLR), 2024

2024
[78]

Augmenting language models with long-term memory

Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. Augmenting language models with long-term memory. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[79]

Recursively summarizing enables long-term dialogue memory in large language models.arXiv preprint arXiv:2308.15022, 2023

Qingyue Wang, Liang Ding, Yanan Cao, Zhiliang Tian, Shi Wang, Dacheng Tao, and Li Guo. Recursively summarizing enables long-term dialogue memory in large language models.arXiv preprint arXiv:2308.15022, 2023

work page arXiv 2023
[80]

Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan O. Arik. Chain of agents: Large language models collaborating on long-context tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[81]

ReSum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

Xixi Wu, Kuan Li, Yida Zhao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou, et al. ReSum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

work page arXiv 2025
[82]

Scaling long-horizon LLM agent via context-folding

Weiwei Sun, Miao Lu, Zhan Ling, et al. Scaling long-horizon LLM agent via context-folding. arXiv preprint arXiv:2510.11967, 2025

work page arXiv 2025

Showing first 80 references.

[1] [1]

Deepseek-v3.1 model card

DeepSeekAI. Deepseek-v3.1 model card. 2025. URL https://huggingface.co/ deepseek-ai/DeepSeek-V3.1

2025

[2] [2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Introducing gpt-5

OpenAI. Introducing gpt-5. 2025. URL https://openai.com/index/ introducing-gpt-5/

2025

[4] [4]

Claude sonnet 4.5 model card

Anthropic. Claude sonnet 4.5 model card. 2025. URL https://www.anthropic.com/ news/claude-sonnet-4-5

2025

[5] [5]

Longcat-flash-thinking-2601 technical report.CoRR, abs/2601.16725, 2026

Meituan LongCat Team. Longcat-flash-thinking-2601 technical report.CoRR, abs/2601.16725, 2026

work page arXiv 2026

[6] [6]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu et al. Deepseek-v3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Qwen3-max model card

Qwen Team. Qwen3-max model card. 2025. URL https://qwen.ai/blog?id= qwen3-max

2025

[8] [8]

A survey of personalized large language models: Progress and future directions.arXiv preprint arXiv:2502.11528, 2025

Jiahong Liu, Zexuan Qiu, Zhongyang Li, Quanyu Dai, Wenhao Yu, Jieming Zhu, Minda Hu, Menglin Yang, Tat-Seng Chua, and Irwin King. A survey of personalized large language models: Progress and future directions.arXiv preprint arXiv:2502.11528, 2025

work page arXiv 2025

[9] [9]

PersonaMem-v2: Towards personalized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688, 2025

Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, et al. PersonaMem-v2: Towards personalized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688, 2025

work page arXiv 2025

[10] [10]

Know me, respond to me: Benchmarking LLMs for dynamic user profiling and personalized responses at scale.arXiv preprint arXiv:2504.14225, 2025

Bowen Jiang, Zhuoqun Hao, Young-Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo J Taylor, and Dan Roth. Know me, respond to me: Benchmarking LLMs for dynamic user profiling and personalized responses at scale.arXiv preprint arXiv:2504.14225, 2025

work page arXiv 2025

[11] [11]

KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

Tongbo Chen, Zhengxi Lu, Zhan Xu, Guocheng Shao, Shaohan Zhao, Fei Tang, Yong Du, Kaitao Song, Yizhou Liu, Yuchen Yan, et al. Knowu-bench: Towards interactive, proactive, and personalized mobile agent evaluation.arXiv preprint arXiv:2604.08455, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

SWE-bench: Can language models resolve real-world GitHub issues? In International Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In International Conference on Learning Representations, 2024

2024

[13] [13]

Agent- Bench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, et al. Agent- Bench: Evaluating LLMs as agents. InInternational Conference on Learning Representations, 2024. 10

2024

[14] [14]

WebArena: A realistic web environment for building autonomous agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, et al. WebArena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024

2024

[15] [15]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.τ 2-bench: Eval- uating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

VitaBench: Benchmarking LLM agents with versatile interactive tasks in real-world applications.arXiv preprint arXiv:2509.26490, 2025

Wei He, Yueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, et al. VitaBench: Benchmarking LLM agents with versatile interactive tasks in real-world applications.arXiv preprint arXiv:2509.26490, 2025

work page arXiv 2025

[18] [18]

Personalization of large language models: A survey.arXiv preprint arXiv:2411.00027, 2024

Zhehao Zhang, Ryan Lutz, Aidan Mao, Tianyue Bao, Zijian Wang, Zhoujian Zhao, Kaixin Xiang, Liwei Ding, Le Tong, Jiaxin Zhuo, et al. Personalization of large language models: A survey.arXiv preprint arXiv:2411.00027, 2024

work page arXiv 2024

[19] [19]

Mem0: The memory layer for personalized AI.https://mem0.ai, 2024

Mem0. Mem0: The memory layer for personalized AI.https://mem0.ai, 2024

2024

[20] [22]

Two tales of persona in LLMs: A survey of role-playing and personalization.arXiv preprint arXiv:2406.01171, 2024

Yu-Min Tseng, Yu-Chao Huang, Teng-Yun Hsiao, Wei-Lin Huang, et al. Two tales of persona in LLMs: A survey of role-playing and personalization.arXiv preprint arXiv:2406.01171, 2024

work page arXiv 2024

[21] [23]

Optimization methods for personalizing large language models through retrieval augmentation

Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. Optimization methods for personalizing large language models through retrieval augmentation. InPro- ceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024

2024

[22] [24]

PEARL: Personalizing large language model writing assistants with generation-calibrated retrievers

Sheshera Mysore, Zhuoran Lu, Mengting Wan, Julian McAuley, and Hamed Zamani. PEARL: Personalizing large language model writing assistants with generation-calibrated retrievers. In Proceedings of the 1st Workshop on Customizable NLP, 2024

2024

[23] [25]

Integrating summa- rization and retrieval for enhanced personalization via large language models.arXiv preprint arXiv:2310.20081, 2023

Jesse Richardson, Kristen Bloom, Aggeliki Founta, and Brendan Mathew. Integrating summa- rization and retrieval for enhanced personalization via large language models.arXiv preprint arXiv:2310.20081, 2023

work page arXiv 2023

[24] [26]

Teach LLMs to personalize–an approach inspired by writing education.arXiv preprint arXiv:2308.07968, 2023

Cheng Li, Mingyang Chen, Haoping Wang, Bin Zhu, Haoyu Luo, et al. Teach LLMs to personalize–an approach inspired by writing education.arXiv preprint arXiv:2308.07968, 2023

work page arXiv 2023

[25] [27]

Understanding the role of user profile in the personalization of large language models.arXiv preprint arXiv:2406.17803, 2024

Ostap Wu, Max Haim, Tanmay Dey, et al. Understanding the role of user profile in the personalization of large language models.arXiv preprint arXiv:2406.17803, 2024

work page arXiv 2024

[26] [28]

Democra- tizing large language models via personalized parameter-efficient fine-tuning

Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, Bing Yin, and Meng Jiang. Democra- tizing large language models via personalized parameter-efficient fine-tuning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

2024

[27] [29]

PLoRA: Personalized low-rank adaptation for human- centered text understanding

Yuting Zhang, Yuliang Ding, et al. PLoRA: Personalized low-rank adaptation for human- centered text understanding. InProceedings of the AAAI Conference on Artificial Intelligence, 2024

2024

[28] [30]

HYDRA: Model factorization framework for black-box LLM personalization.arXiv preprint arXiv:2406.02888, 2024

Tao Zhuang, Xin Wang, Zhirui Yuan, et al. HYDRA: Model factorization framework for black-box LLM personalization.arXiv preprint arXiv:2406.02888, 2024. 11

work page arXiv 2024

[29] [31]

Personalized soups: Per- sonalized large language model alignment via post-hoc parameter merging.arXiv preprint arXiv:2310.11564, 2023

Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Shafran, Yejin Choi, et al. Personalized soups: Personalized large language model alignment via post-hoc parameter merging.arXiv preprint arXiv:2310.11564, 2023

work page arXiv 2023

[30] [32]

Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization.arXiv preprint arXiv:2310.03708, 2023

Zhanhui Zhou, Jie Liu, Jing Dong, Jiaheng Yang, et al. Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization.arXiv preprint arXiv:2310.03708, 2023

work page arXiv 2023

[31] [33]

NextQuill: Causal preference modeling for enhancing LLM personal- ization.arXiv preprint arXiv:2506.02368, 2025

Xiaoyan Zhao, Juntao You, Yang Zhang, Wenjie Wang, Hong Cheng, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. NextQuill: Causal preference modeling for enhancing LLM personal- ization.arXiv preprint arXiv:2506.02368, 2025

work page arXiv 2025

[32] [34]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [35]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, et al. A-MEM: Agentic memory for LLM agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [36]

PersonalLLM: Tailoring LLMs to individual preferences.arXiv preprint arXiv:2409.20296, 2024

Thomas P Zollo, Andrew Weidinger, et al. PersonalLLM: Tailoring LLMs to individual preferences.arXiv preprint arXiv:2409.20296, 2024

work page arXiv 2024

[35] [37]

Do LLMs recognize your preferences? evaluating personalized preference following in LLMs

Xiaoyan Zhao, Yang Zhang, Juntao You, Wenjie Wang, Fuli Feng, et al. Do LLMs recognize your preferences? evaluating personalized preference following in LLMs. InInternational Conference on Learning Representations, 2025

2025

[36] [38]

PersonaBench: Evaluating AI models on understanding personal informa- tion through accessing (synthetic) private user data

Zhaoxuan Tan et al. PersonaBench: Evaluating AI models on understanding personal informa- tion through accessing (synthetic) private user data. InInternational Conference on Learning Representations, 2025

2025

[37] [39]

LaMP: When large language models meet personalization

Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. LaMP: When large language models meet personalization. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

2024

[38] [40]

Jiongnan Liu et al

Ishita Kumar, Snigdha Viswanathan, et al. LongLaMP: A benchmark for personalized long- form text generation.arXiv preprint arXiv:2407.11016, 2024

work page arXiv 2024

[39] [41]

Evaluating very long-term conversational memory of LLM agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

2024

[40] [42]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Wu, Kai Yu, et al. LongMemEval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [43]

MemSim: A Bayesian simulator for evaluating memory of personal assistants.arXiv preprint arXiv:2409.20163, 2024

Zeyu Zhang et al. MemSim: A Bayesian simulator for evaluating memory of personal assistants.arXiv preprint arXiv:2409.20163, 2024

work page arXiv 2024

[42] [44]

AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

Jianfei Xiao, Xiang Yu, Chengbing Wang, Wuqiang Zheng, Xinyu Lin, Kaining Liu, Hongxun Ding, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. AlpsBench: An LLM personalization benchmark for real-dialogue memorization and preference alignment.arXiv preprint arXiv:2603.26680, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[43] [45]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, 2023

2023

[44] [46]

API-Bank: A comprehensive benchmark for tool-augmented LLMs

Minghao Li, Feifan Song, Bowen Yu, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-Bank: A comprehensive benchmark for tool-augmented LLMs. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

2023

[45] [47]

Gorilla: Large language model connected with massive APIs

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive APIs. InAdvances in Neural Information Processing Systems, 2024. 12

2024

[46] [48]

ToolTalk: Evaluating tool-usage in a conversational setting

Nicholas Farn and Richard Shin. ToolTalk: Evaluating tool-usage in a conversational setting. arXiv preprint arXiv:2311.10775, 2023

work page arXiv 2023

[47] [49]

MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback

Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback. In International Conference on Learning Representations, 2024

2024

[48] [50]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [51]

ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities.arXiv preprint arXiv:2408.04682, 2024

Jiarui Lu, Thomas Zhu, Hao Jiang, Marta Skreta, Arun Sai Rawat, et al. ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities.arXiv preprint arXiv:2408.04682, 2024

work page arXiv 2024

[50] [52]

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

Wentao Shi, Yu Wang, Yuyang Zhao, Yuxin Chen, Fuli Feng, Xueyuan Hao, Xi Su, Qi Gu, Hui Su, Xunliang Cai, et al. Aj-bench: Benchmarking agent-as-a-judge for environment-aware evaluation.arXiv preprint arXiv:2604.18240, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[51] [53]

OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information Processing Systems, 2024

2024

[52] [54]

AppWorld: A controllable world of apps and people for benchmarking interactive coding agents

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Reshef Manber, Vinty Baber, David Fishi, et al. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

2024

[53] [55]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, et al. SkillsBench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[54] [56]

Agentnoisebench: Benchmarking robustness of tool-using llm agents under noisy condition.arXiv preprint arXiv:2602.11348, 2026

Ruipeng Wang, Yuxin Chen, Yukai Wang, Chang Wu, Junfeng Fang, Xiaodong Cai, Qi Gu, Hui Su, An Zhang, Xiang Wang, et al. Agentnoisebench: Benchmarking robustness of tool-using llm agents under noisy condition.arXiv preprint arXiv:2602.11348, 2026

work page arXiv 2026

[55] [57]

Risky-bench: Probing agentic safety risks under real-world deployment.arXiv preprint arXiv:2602.03100, 2026

Jingnan Zheng, Yanzhen Luo, Jingjun Xu, Bingnan Liu, Yuxin Chen, Chenhang Cui, Gelei Deng, Chaochao Lu, Xiang Wang, An Zhang, et al. Risky-bench: Probing agentic safety risks under real-world deployment.arXiv preprint arXiv:2602.03100, 2026

work page arXiv 2026

[56] [58]

Introducing gpt-4.1 in the api

OpenAI. Introducing gpt-4.1 in the api. 2025. URL https://openai.com/index/ gpt-4-1/

2025

[57] [59]

Introducing gpt-5.1

OpenAI. Introducing gpt-5.1. 2025. URLhttps://openai.com/index/gpt-5-1/

2025

[58] [60]

Introducing gpt-5.2

OpenAI. Introducing gpt-5.2. 2025. URL https://openai.com/index/ introducing-gpt-5-2/

2025

[59] [61]

Introducing o3 and o4-mini

OpenAI. Introducing o3 and o4-mini. 2025. URL https://openai.com/index/ introducing-o3-and-o4-mini/

2025

[60] [62]

Deepseek-v4 model card

DeepSeekAI. Deepseek-v4 model card. 2026. URL huggingface.co/deepseek-ai/ DeepSeek-V4-Pro

2026

[61] [63]

Claude sonnet 4 system card

Anthropic. Claude sonnet 4 system card. 2025. URL https://www.anthropic.com/news/ claude-4

2025

[62] [64]

Claude opus 4.6 system card

Anthropic. Claude opus 4.6 system card. 2026. URL https://www.anthropic.com/ claude-opus-4-6-system-card

2026

[63] [65]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici et al. Gemini 2.5: Advanced reasoning, multimodality, and agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[64] [66]

Gemini 2.5 pro model card

Google. Gemini 2.5 pro model card. 2025. URL https://modelcards.withgoogle.com/ assets/documents/gemini-2.5-pro.pdf. 13

2025

[65] [67]

Gemini 2.5 flash model card

Google. Gemini 2.5 flash model card. 2025. URL https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-2-5-Flash-Model-Card.pdf

2025

[66] [68]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Aohan Zeng et al. Glm-4.5: Agentic, reasoning, and coding foundation models.arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[67] [69]

Glm-4.6 technical blog

Z.ai. Glm-4.6 technical blog. 2025. URLhttps://z.ai/blog/glm-4.6

2025

[68] [70]

GLM-5.1 model card

Z.ai. GLM-5.1 model card. 2026. URLhttps://huggingface.co/zai-org/GLM-5.1

2026

[69] [71]

Seed 1.6 technical introduction

ByteDance. Seed 1.6 technical introduction. 2025. URL https://seed.bytedance.com/ en/seed1_6

2025

[70] [72]

Seed 2.0 model card: Towards intelligence frontier for real-world complexity

ByteDance Seed. Seed 2.0 model card: Towards intelligence frontier for real-world complexity

[71] [73]

URLseed.bytedance.com/en/seed2

[72] [74]

Kimi-K2.6 model card

Moonshot AI. Kimi-K2.6 model card. 2026. URL https://huggingface.co/ moonshotai/Kimi-K2.6

2026

[73] [75]

Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

Meituan LongCat Team. Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

work page arXiv 2025

[74] [76]

MiniMax-M2.7: Model self-improvement, driving productivity innovation through technological breakthroughs

MiniMax. MiniMax-M2.7: Model self-improvement, driving productivity innovation through technological breakthroughs. 2026. URLhttps://www.minimax.io/models/text/m27

2026

[75] [77]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations (ICLR), 2024

2024

[76] [78]

Augmenting language models with long-term memory

Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. Augmenting language models with long-term memory. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[77] [79]

Recursively summarizing enables long-term dialogue memory in large language models.arXiv preprint arXiv:2308.15022, 2023

Qingyue Wang, Liang Ding, Yanan Cao, Zhiliang Tian, Shi Wang, Dacheng Tao, and Li Guo. Recursively summarizing enables long-term dialogue memory in large language models.arXiv preprint arXiv:2308.15022, 2023

work page arXiv 2023

[78] [80]

Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan O. Arik. Chain of agents: Large language models collaborating on long-context tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[79] [81]

ReSum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

Xixi Wu, Kuan Li, Yida Zhao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou, et al. ReSum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

work page arXiv 2025

[80] [82]

Scaling long-horizon LLM agent via context-folding

Weiwei Sun, Miao Lu, Zhan Ling, et al. Scaling long-horizon LLM agent via context-folding. arXiv preprint arXiv:2510.11967, 2025

work page arXiv 2025