pith. sign in

arxiv: 2605.27141 · v1 · pith:V4WBGVA5new · submitted 2026-05-26 · 💻 cs.AI

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

Pith reviewed 2026-06-29 17:07 UTC · model grok-4.3

classification 💻 cs.AI
keywords personalized agentsproactive agentslong-term interactionsLLM benchmarksuser preference inferencememory architecturesagent evaluation
0
0 comments X

The pith

VitaBench 2.0 shows that state-of-the-art LLMs still struggle to personalize and act proactively in long-term user interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VitaBench 2.0 to test how well LLM agents infer and apply user preferences from fragmented, temporally ordered interactions across individual users. Tasks require continuous extraction, updating, and use of preferences plus proactive steps to gather missing information before decisions. Benchmark results on frontier models indicate these abilities remain highly challenging and fall short of practical needs. An extensible memory interface is provided to enable systematic comparisons of architectures, and the work catalogs observed failure modes in preference handling.

Core claim

VitaBench 2.0 organizes evaluation tasks as temporally ordered sequences for individual users in which preferences are embedded in fragmented and heterogeneous interactions; successful performance demands that agents continuously extract, utilize, and update those preferences while also recognizing and acquiring missing information from users or environments, and frontier models exhibit a substantial gap from the required capabilities.

What carries the argument

The VitaBench 2.0 benchmark of user-specific temporal task sequences with embedded preferences, combined with proactiveness tasks that test acquisition of missing information and an extensible memory interface for architecture comparisons.

Load-bearing premise

The constructed tasks and preference embeddings accurately capture the inference and proactivity demands of genuine long-term user interactions.

What would settle it

A direct comparison in which agents that perform well on VitaBench 2.0 are deployed in real extended user sessions and their personalization accuracy or user satisfaction is measured against benchmark scores.

Figures

Figures reproduced from arXiv: 2605.27141 by An Zhang, Chenhang Cui, Jingnan Zheng, Qi Gu, Tat-Seng Chua, Xiang Wang, Xi Su, Xunliang Cai, Yaorui Shi, Yaqi Huo, Yi Zhang, Yuxin Chen, Zhengzhou Cai, Zhiyuan Yao.

Figure 1
Figure 1. Figure 1: Overview of VitaBench 2.0. The agents are required to operate over temporal task sequences [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average performance across tasks at each temporal task index. 4 Experiment 4.1 Experimental Setups Models. We evaluate a diverse set of state-of-the-art proprietary and open LLMs, covering both non-thinking and thinking configurations when available. The evaluated models include OpenAI family, including GPT-3.5-Turbo, GPT-4o-mini, GPT-5, and o-series models such as o3 and o4- mini [58, 3, 59–61]; the DeepS… view at source ↗
Figure 4
Figure 4. Figure 4: Analysis of model behavior on VitaBench 2.0. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Failure pattern statistics for DeepSeek-V4-Pro and DeepSeek-R1. Category A denotes [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of user profile statistics in VitaBench 2.0. [PITH_FULL_IMAGE:figures/full_fig_p037_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overview of user preference statistics in VitaBench 2.0. [PITH_FULL_IMAGE:figures/full_fig_p038_7.png] view at source ↗
read the original abstract

Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. However, existing agent benchmarks primarily evaluate reasoning and tool use, largely overlooking the challenges of inferring and leveraging user preferences in realistic scenarios. To address this gap, we introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions. In VitaBench 2.0, tasks are organized as temporally ordered sequences for individual users, where preferences are embedded in fragmented and heterogeneous interactions. Successful completion of tasks requires the agent to continuously extract, utilize, and update user preferences from these interactions. We further evaluate proactiveness through tasks that require agents to recognize missing information and actively acquire it from users or environments before making decisions. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across different memory architectures. We benchmark a diverse set of frontier proprietary and open-source LLMs. Results show that real-world personalization remains highly challenging even for state-of-the-art models, revealing a substantial gap between current capabilities and practical requirements. Extensive analysis further reveals the failure modes and capability bottlenecks of current agents in real-world personalized decision-making, providing insights for future model improvements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 0 minor

Summary. The paper introduces VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions. Tasks are organized as temporally ordered sequences for individual users with preferences embedded in fragmented and heterogeneous interactions; successful completion requires agents to continuously extract, utilize, and update user preferences. Proactiveness is evaluated via tasks that require recognizing missing information and actively acquiring it from users or environments. An extensible memory interface supports controlled comparisons across memory architectures. Benchmarking of frontier proprietary and open-source LLMs shows that real-world personalization remains highly challenging, revealing a substantial gap between current capabilities and practical requirements, along with analysis of failure modes.

Significance. If the benchmark construction and evaluation protocol hold, the work fills a clear gap in existing agent benchmarks by targeting long-term personalization and proactivity rather than isolated reasoning or tool use. The extensible memory interface is a concrete strength that enables systematic ablation across architectures. The paper supplies the task-generation procedure, memory interface, and evaluation protocol, supporting reproducibility; the scoped claim of a capability gap on this benchmark is internally consistent and provides actionable insights into bottlenecks for future model development.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of VitaBench 2.0, their recognition of the benchmark's contributions to long-term personalization and proactivity, and their recommendation to accept the manuscript.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces VitaBench 2.0 as an empirical benchmark for agent personalization and proactivity, with tasks organized as temporally ordered sequences embedding fragmented preferences, plus an extensible memory interface and evaluation protocol. The central claim—that state-of-the-art models exhibit a substantial gap on these tasks—is supported directly by reported benchmark results rather than any derivation, equation, fitted parameter, or self-citation chain that reduces to prior inputs. No load-bearing step equates a prediction to its own construction; the argument remains self-contained within the supplied task-generation and evaluation procedures.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work introduces a benchmark rather than a theoretical derivation; it relies on standard assumptions about user behavior and interaction modeling without introducing new free parameters, axioms beyond domain conventions, or invented physical entities.

pith-pipeline@v0.9.1-grok · 5830 in / 1132 out tokens · 25206 ms · 2026-06-29T17:07:24.381916+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

126 extracted references · 49 canonical work pages · 23 internal anchors

  1. [1]

    Deepseek-v3.1 model card

    DeepSeekAI. Deepseek-v3.1 model card. 2025. URL https://huggingface.co/ deepseek-ai/DeepSeek-V3.1

  2. [2]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  3. [3]

    Introducing gpt-5

    OpenAI. Introducing gpt-5. 2025. URL https://openai.com/index/ introducing-gpt-5/

  4. [4]

    Claude sonnet 4.5 model card

    Anthropic. Claude sonnet 4.5 model card. 2025. URL https://www.anthropic.com/ news/claude-sonnet-4-5

  5. [5]

    Longcat-flash-thinking-2601 technical report.CoRR, abs/2601.16725, 2026

    Meituan LongCat Team. Longcat-flash-thinking-2601 technical report.CoRR, abs/2601.16725, 2026

  6. [6]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Aixin Liu et al. Deepseek-v3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

  7. [7]

    Qwen3-max model card

    Qwen Team. Qwen3-max model card. 2025. URL https://qwen.ai/blog?id= qwen3-max

  8. [8]

    A survey of personalized large language models: Progress and future directions.arXiv preprint arXiv:2502.11528, 2025

    Jiahong Liu, Zexuan Qiu, Zhongyang Li, Quanyu Dai, Wenhao Yu, Jieming Zhu, Minda Hu, Menglin Yang, Tat-Seng Chua, and Irwin King. A survey of personalized large language models: Progress and future directions.arXiv preprint arXiv:2502.11528, 2025

  9. [9]

    PersonaMem-v2: Towards personalized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688, 2025

    Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, et al. PersonaMem-v2: Towards personalized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688, 2025

  10. [10]

    Know me, respond to me: Benchmarking LLMs for dynamic user profiling and personalized responses at scale.arXiv preprint arXiv:2504.14225, 2025

    Bowen Jiang, Zhuoqun Hao, Young-Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo J Taylor, and Dan Roth. Know me, respond to me: Benchmarking LLMs for dynamic user profiling and personalized responses at scale.arXiv preprint arXiv:2504.14225, 2025

  11. [11]

    KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

    Tongbo Chen, Zhengxi Lu, Zhan Xu, Guocheng Shao, Shaohan Zhao, Fei Tang, Yong Du, Kaitao Song, Yizhou Liu, Yuchen Yan, et al. Knowu-bench: Towards interactive, proactive, and personalized mobile agent evaluation.arXiv preprint arXiv:2604.08455, 2026

  12. [12]

    SWE-bench: Can language models resolve real-world GitHub issues? In International Conference on Learning Representations, 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In International Conference on Learning Representations, 2024

  13. [13]

    Agent- Bench: Evaluating LLMs as agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, et al. Agent- Bench: Evaluating LLMs as agents. InInternational Conference on Learning Representations, 2024. 10

  14. [14]

    WebArena: A realistic web environment for building autonomous agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, et al. WebArena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024

  15. [15]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

  16. [16]

    Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.τ 2-bench: Eval- uating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

  17. [17]

    VitaBench: Benchmarking LLM agents with versatile interactive tasks in real-world applications.arXiv preprint arXiv:2509.26490, 2025

    Wei He, Yueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, et al. VitaBench: Benchmarking LLM agents with versatile interactive tasks in real-world applications.arXiv preprint arXiv:2509.26490, 2025

  18. [18]

    Personalization of large language models: A survey.arXiv preprint arXiv:2411.00027, 2024

    Zhehao Zhang, Ryan Lutz, Aidan Mao, Tianyue Bao, Zijian Wang, Zhoujian Zhao, Kaixin Xiang, Liwei Ding, Le Tong, Jiaxin Zhuo, et al. Personalization of large language models: A survey.arXiv preprint arXiv:2411.00027, 2024

  19. [19]

    Mem0: The memory layer for personalized AI.https://mem0.ai, 2024

    Mem0. Mem0: The memory layer for personalized AI.https://mem0.ai, 2024

  20. [22]

    Two tales of persona in LLMs: A survey of role-playing and personalization.arXiv preprint arXiv:2406.01171, 2024

    Yu-Min Tseng, Yu-Chao Huang, Teng-Yun Hsiao, Wei-Lin Huang, et al. Two tales of persona in LLMs: A survey of role-playing and personalization.arXiv preprint arXiv:2406.01171, 2024

  21. [23]

    Optimization methods for personalizing large language models through retrieval augmentation

    Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. Optimization methods for personalizing large language models through retrieval augmentation. InPro- ceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024

  22. [24]

    PEARL: Personalizing large language model writing assistants with generation-calibrated retrievers

    Sheshera Mysore, Zhuoran Lu, Mengting Wan, Julian McAuley, and Hamed Zamani. PEARL: Personalizing large language model writing assistants with generation-calibrated retrievers. In Proceedings of the 1st Workshop on Customizable NLP, 2024

  23. [25]

    Integrating summa- rization and retrieval for enhanced personalization via large language models.arXiv preprint arXiv:2310.20081, 2023

    Jesse Richardson, Kristen Bloom, Aggeliki Founta, and Brendan Mathew. Integrating summa- rization and retrieval for enhanced personalization via large language models.arXiv preprint arXiv:2310.20081, 2023

  24. [26]

    Teach LLMs to personalize–an approach inspired by writing education.arXiv preprint arXiv:2308.07968, 2023

    Cheng Li, Mingyang Chen, Haoping Wang, Bin Zhu, Haoyu Luo, et al. Teach LLMs to personalize–an approach inspired by writing education.arXiv preprint arXiv:2308.07968, 2023

  25. [27]

    Understanding the role of user profile in the personalization of large language models.arXiv preprint arXiv:2406.17803, 2024

    Ostap Wu, Max Haim, Tanmay Dey, et al. Understanding the role of user profile in the personalization of large language models.arXiv preprint arXiv:2406.17803, 2024

  26. [28]

    Democra- tizing large language models via personalized parameter-efficient fine-tuning

    Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, Bing Yin, and Meng Jiang. Democra- tizing large language models via personalized parameter-efficient fine-tuning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

  27. [29]

    PLoRA: Personalized low-rank adaptation for human- centered text understanding

    Yuting Zhang, Yuliang Ding, et al. PLoRA: Personalized low-rank adaptation for human- centered text understanding. InProceedings of the AAAI Conference on Artificial Intelligence, 2024

  28. [30]

    HYDRA: Model factorization framework for black-box LLM personalization.arXiv preprint arXiv:2406.02888, 2024

    Tao Zhuang, Xin Wang, Zhirui Yuan, et al. HYDRA: Model factorization framework for black-box LLM personalization.arXiv preprint arXiv:2406.02888, 2024. 11

  29. [31]

    Personalized soups: Per- sonalized large language model alignment via post-hoc parameter merging.arXiv preprint arXiv:2310.11564, 2023

    Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Shafran, Yejin Choi, et al. Personalized soups: Personalized large language model alignment via post-hoc parameter merging.arXiv preprint arXiv:2310.11564, 2023

  30. [32]

    Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization.arXiv preprint arXiv:2310.03708, 2023

    Zhanhui Zhou, Jie Liu, Jing Dong, Jiaheng Yang, et al. Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization.arXiv preprint arXiv:2310.03708, 2023

  31. [33]

    NextQuill: Causal preference modeling for enhancing LLM personal- ization.arXiv preprint arXiv:2506.02368, 2025

    Xiaoyan Zhao, Juntao You, Yang Zhang, Wenjie Wang, Hong Cheng, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. NextQuill: Causal preference modeling for enhancing LLM personal- ization.arXiv preprint arXiv:2506.02368, 2025

  32. [34]

    MemGPT: Towards LLMs as Operating Systems

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560, 2023

  33. [35]

    A-MEM: Agentic Memory for LLM Agents

    Wujiang Xu, Zujie Liang, Kai Mei, et al. A-MEM: Agentic memory for LLM agents.arXiv preprint arXiv:2502.12110, 2025

  34. [36]

    PersonalLLM: Tailoring LLMs to individual preferences.arXiv preprint arXiv:2409.20296, 2024

    Thomas P Zollo, Andrew Weidinger, et al. PersonalLLM: Tailoring LLMs to individual preferences.arXiv preprint arXiv:2409.20296, 2024

  35. [37]

    Do LLMs recognize your preferences? evaluating personalized preference following in LLMs

    Xiaoyan Zhao, Yang Zhang, Juntao You, Wenjie Wang, Fuli Feng, et al. Do LLMs recognize your preferences? evaluating personalized preference following in LLMs. InInternational Conference on Learning Representations, 2025

  36. [38]

    PersonaBench: Evaluating AI models on understanding personal informa- tion through accessing (synthetic) private user data

    Zhaoxuan Tan et al. PersonaBench: Evaluating AI models on understanding personal informa- tion through accessing (synthetic) private user data. InInternational Conference on Learning Representations, 2025

  37. [39]

    LaMP: When large language models meet personalization

    Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. LaMP: When large language models meet personalization. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

  38. [40]

    Jiongnan Liu et al

    Ishita Kumar, Snigdha Viswanathan, et al. LongLaMP: A benchmark for personalized long- form text generation.arXiv preprint arXiv:2407.11016, 2024

  39. [41]

    Evaluating very long-term conversational memory of LLM agents

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

  40. [42]

    LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Wu, Kai Yu, et al. LongMemEval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

  41. [43]

    MemSim: A Bayesian simulator for evaluating memory of personal assistants.arXiv preprint arXiv:2409.20163, 2024

    Zeyu Zhang et al. MemSim: A Bayesian simulator for evaluating memory of personal assistants.arXiv preprint arXiv:2409.20163, 2024

  42. [44]

    AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

    Jianfei Xiao, Xiang Yu, Chengbing Wang, Wuqiang Zheng, Xinyu Lin, Kaining Liu, Hongxun Ding, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. AlpsBench: An LLM personalization benchmark for real-dialogue memorization and preference alignment.arXiv preprint arXiv:2603.26680, 2026

  43. [45]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, 2023

  44. [46]

    API-Bank: A comprehensive benchmark for tool-augmented LLMs

    Minghao Li, Feifan Song, Bowen Yu, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-Bank: A comprehensive benchmark for tool-augmented LLMs. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

  45. [47]

    Gorilla: Large language model connected with massive APIs

    Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive APIs. InAdvances in Neural Information Processing Systems, 2024. 12

  46. [48]

    ToolTalk: Evaluating tool-usage in a conversational setting

    Nicholas Farn and Richard Shin. ToolTalk: Evaluating tool-usage in a conversational setting. arXiv preprint arXiv:2311.10775, 2023

  47. [49]

    MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback

    Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback. In International Conference on Learning Representations, 2024

  48. [50]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789, 2023

  49. [51]

    ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities.arXiv preprint arXiv:2408.04682, 2024

    Jiarui Lu, Thomas Zhu, Hao Jiang, Marta Skreta, Arun Sai Rawat, et al. ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities.arXiv preprint arXiv:2408.04682, 2024

  50. [52]

    AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

    Wentao Shi, Yu Wang, Yuyang Zhao, Yuxin Chen, Fuli Feng, Xueyuan Hao, Xi Su, Qi Gu, Hui Su, Xunliang Cai, et al. Aj-bench: Benchmarking agent-as-a-judge for environment-aware evaluation.arXiv preprint arXiv:2604.18240, 2026

  51. [53]

    OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information Processing Systems, 2024

  52. [54]

    AppWorld: A controllable world of apps and people for benchmarking interactive coding agents

    Harsh Trivedi, Tushar Khot, Mareike Hartmann, Reshef Manber, Vinty Baber, David Fishi, et al. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

  53. [55]

    SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

    Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, et al. SkillsBench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026

  54. [56]

    Agentnoisebench: Benchmarking robustness of tool-using llm agents under noisy condition.arXiv preprint arXiv:2602.11348, 2026

    Ruipeng Wang, Yuxin Chen, Yukai Wang, Chang Wu, Junfeng Fang, Xiaodong Cai, Qi Gu, Hui Su, An Zhang, Xiang Wang, et al. Agentnoisebench: Benchmarking robustness of tool-using llm agents under noisy condition.arXiv preprint arXiv:2602.11348, 2026

  55. [57]

    Risky-bench: Probing agentic safety risks under real-world deployment.arXiv preprint arXiv:2602.03100, 2026

    Jingnan Zheng, Yanzhen Luo, Jingjun Xu, Bingnan Liu, Yuxin Chen, Chenhang Cui, Gelei Deng, Chaochao Lu, Xiang Wang, An Zhang, et al. Risky-bench: Probing agentic safety risks under real-world deployment.arXiv preprint arXiv:2602.03100, 2026

  56. [58]

    Introducing gpt-4.1 in the api

    OpenAI. Introducing gpt-4.1 in the api. 2025. URL https://openai.com/index/ gpt-4-1/

  57. [59]

    Introducing gpt-5.1

    OpenAI. Introducing gpt-5.1. 2025. URLhttps://openai.com/index/gpt-5-1/

  58. [60]

    Introducing gpt-5.2

    OpenAI. Introducing gpt-5.2. 2025. URL https://openai.com/index/ introducing-gpt-5-2/

  59. [61]

    Introducing o3 and o4-mini

    OpenAI. Introducing o3 and o4-mini. 2025. URL https://openai.com/index/ introducing-o3-and-o4-mini/

  60. [62]

    Deepseek-v4 model card

    DeepSeekAI. Deepseek-v4 model card. 2026. URL huggingface.co/deepseek-ai/ DeepSeek-V4-Pro

  61. [63]

    Claude sonnet 4 system card

    Anthropic. Claude sonnet 4 system card. 2025. URL https://www.anthropic.com/news/ claude-4

  62. [64]

    Claude opus 4.6 system card

    Anthropic. Claude opus 4.6 system card. 2026. URL https://www.anthropic.com/ claude-opus-4-6-system-card

  63. [65]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici et al. Gemini 2.5: Advanced reasoning, multimodality, and agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  64. [66]

    Gemini 2.5 pro model card

    Google. Gemini 2.5 pro model card. 2025. URL https://modelcards.withgoogle.com/ assets/documents/gemini-2.5-pro.pdf. 13

  65. [67]

    Gemini 2.5 flash model card

    Google. Gemini 2.5 flash model card. 2025. URL https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-2-5-Flash-Model-Card.pdf

  66. [68]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    Aohan Zeng et al. Glm-4.5: Agentic, reasoning, and coding foundation models.arXiv preprint arXiv:2508.06471, 2025

  67. [69]

    Glm-4.6 technical blog

    Z.ai. Glm-4.6 technical blog. 2025. URLhttps://z.ai/blog/glm-4.6

  68. [70]

    GLM-5.1 model card

    Z.ai. GLM-5.1 model card. 2026. URLhttps://huggingface.co/zai-org/GLM-5.1

  69. [71]

    Seed 1.6 technical introduction

    ByteDance. Seed 1.6 technical introduction. 2025. URL https://seed.bytedance.com/ en/seed1_6

  70. [72]

    Seed 2.0 model card: Towards intelligence frontier for real-world complexity

    ByteDance Seed. Seed 2.0 model card: Towards intelligence frontier for real-world complexity

  71. [73]

    URLseed.bytedance.com/en/seed2

  72. [74]

    Kimi-K2.6 model card

    Moonshot AI. Kimi-K2.6 model card. 2026. URL https://huggingface.co/ moonshotai/Kimi-K2.6

  73. [75]

    Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

    Meituan LongCat Team. Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

  74. [76]

    MiniMax-M2.7: Model self-improvement, driving productivity innovation through technological breakthroughs

    MiniMax. MiniMax-M2.7: Model self-improvement, driving productivity innovation through technological breakthroughs. 2026. URLhttps://www.minimax.io/models/text/m27

  75. [77]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations (ICLR), 2024

  76. [78]

    Augmenting language models with long-term memory

    Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. Augmenting language models with long-term memory. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  77. [79]

    Recursively summarizing enables long-term dialogue memory in large language models.arXiv preprint arXiv:2308.15022, 2023

    Qingyue Wang, Liang Ding, Yanan Cao, Zhiliang Tian, Shi Wang, Dacheng Tao, and Li Guo. Recursively summarizing enables long-term dialogue memory in large language models.arXiv preprint arXiv:2308.15022, 2023

  78. [80]

    Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan O. Arik. Chain of agents: Large language models collaborating on long-context tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  79. [81]

    ReSum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

    Xixi Wu, Kuan Li, Yida Zhao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou, et al. ReSum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

  80. [82]

    Scaling long-horizon LLM agent via context-folding

    Weiwei Sun, Miao Lu, Zhan Ling, et al. Scaling long-horizon LLM agent via context-folding. arXiv preprint arXiv:2510.11967, 2025

Showing first 80 references.