pith. machine review for the scientific record. sign in

arxiv: 2604.27043 · v1 · submitted 2026-04-29 · 💻 cs.CL

Recognition: unknown

CL-bench Life: Can Language Models Learn from Real-Life Context?

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:38 UTC · model grok-4.3

classification 💻 cs.CL
keywords language modelscontext learningreal-life contextsbenchmark evaluationAI assistantspersonal datafrontier modelstask solving
0
0 comments X

The pith

Frontier language models cannot reliably learn from messy real-life contexts, solving tasks at only 19.3 percent even for the best performer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CL-bench Life, a human-curated benchmark of 405 context-task pairs drawn from everyday scenarios such as group chats and personal records, to test whether current language models can extract useful knowledge from fragmented, personal contexts. Evaluation of ten frontier models reveals low performance, with the top model reaching just 19.3 percent task success and the average across models at 13.8 percent. This matters because AI assistants are shifting from professional tools to everyday companions that must handle personal and social data. The benchmark uses 5,348 verification rubrics to measure reasoning over contexts that existing tests do not cover. If the results hold, progress toward reliable personal AI will require new methods beyond current scaling approaches.

Core claim

Real-life context learning remains highly challenging for current frontier language models. Even the best-performing model achieves only 19.3 percent task solving rate on CL-bench Life, while the average across ten models is 13.8 percent. Models struggle particularly with reasoning over messy group chat histories and fragmented behavioral records from everyday life. The benchmark consists of 405 human-curated context-task pairs and 5,348 rubrics designed to require strong abilities to learn from complex, personal, and fragmented real-life contexts that go beyond existing professional or synthetic benchmarks.

What carries the argument

CL-bench Life benchmark, a set of 405 human-curated context-task pairs and 5,348 verification rubrics covering real-life scenarios such as multi-party conversations, personal archives, and behavioral traces.

If this is right

  • Frontier models require substantial advances before they can serve as reliable everyday personal assistants.
  • Progress on CL-bench Life would directly improve AI handling of personal conversations and records.
  • Current evaluation methods that use clean or professional contexts are insufficient for real-life applications.
  • New training techniques focused on fragmented and multi-party data are needed to close the performance gap.
  • The benchmark serves as a testbed that can guide development of more capable life-context models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Improving performance here may require explicit handling of privacy constraints when models process personal archives.
  • The gap suggests that simply increasing model size may not close the difference between professional and personal context handling.
  • The benchmark could be extended with dynamic, ongoing contexts to better simulate real assistant usage over time.
  • Low scores indicate that models may need separate modules for integrating behavioral traces rather than treating all context as text alone.

Load-bearing premise

The 405 context-task pairs and 5,348 rubrics faithfully represent the core difficulties of real-life context learning and that success on them will predict performance in actual personal use.

What would settle it

A new model scoring above 70 percent on the benchmark yet still failing to handle personal tasks in uncontrolled real-world settings, or a model scoring below 10 percent that succeeds in those settings.

Figures

Figures reproduced from arXiv: 2604.27043 by Changze Lv, Cheng Zhang, Chenhao Huang, Di Wang, Huaibing Xie, Jiahang Lin, Jianglu Hu, Jiayi Chen, Jiazheng Zhang, Jingcheng Deng, Junjie Ye, Junzhe Wang, Lincheng Li, Minda Hu, Ming Zhang, Pluto Zhou, Qianyu He, Qi Zhang, Shaofan Liu, Shaolei Wang, Shichun Liu, Shihan Dou, Shunyu Yao, Syrus Zhao, Tao Gui, Tao Ji, Weichao Wang, Xuanjing Huang, Yang Liu, Yan Lei, Yanling Xiao, Yiting Liu, Yu-Gang Jiang, Yujiong Shen, Zenan Xu, Zhangyue Yin, Zhen Guo, Ziliang Zhao.

Figure 1
Figure 1. Figure 1: Three simplified examples of real-life contexts and tasks. They illustrate how real-life view at source ↗
Figure 2
Figure 2. Figure 2: Each instance in CL-bench Life consists of a task, a context, and a set of verification rubrics. view at source ↗
Figure 4
Figure 4. Figure 4: Each subcategory contains an equal number of contexts to avoid evaluation bias toward view at source ↗
Figure 3
Figure 3. Figure 3: Three cases from CL-bench Life. Case 1: A long, noisy group chat about planning a November “fancy” book club dinner. The task requires tracking changes in date and venue, reconciling conflicting member constraints, inferring final attendance from scattered updates, and judging whether the final plan satisfies the group’s stated preferences. Case 2: Fragmented personal cycling archives, including trip plans… view at source ↗
Figure 5
Figure 5. Figure 5: Performance comparison between reasoning and non-reasoning settings. view at source ↗
Figure 6
Figure 6. Figure 6: Task solving rate across context length bins under reasoning and non-reasoning settings. view at source ↗
Figure 7
Figure 7. Figure 7: Average reasoning tokens (left) and solution tokens (right) versus task solving rate. Arrows connect reasoning and non-reasoning variants of the same model. More reasoning tokens generally correlate with a higher solving rate. Longer answers do not reliably predict a higher solving rate. 10 view at source ↗
Figure 8
Figure 8. Figure 8: Fine-grained breakdown of failures in group conversations & meeting transcripts cate￾gory. Reasoning quality matters more than reason￾ing volume view at source ↗
Figure 9
Figure 9. Figure 9: Task pass rate under different thresholds for all ten models. As the threshold increases, pass view at source ↗
Figure 10
Figure 10. Figure 10: Model performance under Pass@k and Pass^k. view at source ↗
Figure 11
Figure 11. Figure 11: An example from CL-bench Life where the task requires identifying the largest recurring view at source ↗
read the original abstract

Today's AI assistants such as OpenClaw are designed to handle context effectively, making context learning an increasingly important capability for models. As these systems move beyond professional settings into everyday life, the nature of the contexts they must handle also shifts. Real-life contexts are often messy, fragmented, and deeply tied to personal and social experience, such as multi-party conversations, personal archives, and behavioral traces. Yet it remains unclear whether current frontier language models can reliably learn from such contexts and solve tasks grounded in them. To this end, we introduce CL-bench Life, a fully human-curated benchmark comprising 405 context-task pairs and 5,348 verification rubrics, covering common real-life scenarios. Solving tasks in CL-bench Life requires models to reason over complex, messy real-life contexts, calling for strong real-life context learning abilities that go far beyond those evaluated in existing benchmarks. We evaluate ten frontier LMs and find that real-life context learning remains highly challenging: even the best-performing model achieves only 19.3% task solving rate, while the average performance across models is only 13.8%. Models still struggle to reason over contexts such as messy group chat histories and fragmented behavioral records from everyday life. CL-bench Life provides a crucial testbed for advancing real-life context learning, and progress on it can enable more intelligent and reliable AI assistants in everyday life.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces CL-bench Life, a human-curated benchmark of 405 context-task pairs and 5,348 verification rubrics targeting real-life scenarios such as multi-party chats and personal behavioral traces. It evaluates ten frontier language models, reports a maximum task-solving rate of 19.3% and an average of 13.8%, and concludes that real-life context learning remains highly challenging for current models.

Significance. If the benchmark construction and evaluation protocol are shown to be reliable, the work would provide a useful testbed for measuring progress on context learning outside professional settings. The scale of the rubric set is a positive feature for reproducible scoring.

major comments (2)
  1. [Abstract] Abstract: The claim that the reported scores demonstrate inherent difficulty in real-life context learning is not supported by the provided evidence. No human performance baseline on the same 405 pairs using the 5,348 rubrics is reported, nor are inter-annotator agreement statistics or details on how task difficulty and representativeness were controlled during curation. These omissions are load-bearing for interpreting low model scores as model limitations rather than artifacts of benchmark design.
  2. [Benchmark construction and evaluation sections] Benchmark construction and evaluation sections: The manuscript states that the contexts are 'messy, fragmented, and deeply tied to personal experience' but provides no quantitative validation (e.g., distribution statistics or expert review) that the 405 pairs match actual personal-use distributions. Without such grounding, the generalizability of the 13.8% average result to 'everyday life' remains unestablished.
minor comments (1)
  1. [Results] The abstract and results sections would benefit from an explicit table listing per-model scores and error breakdowns to allow readers to identify which context types drive the low aggregate numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of benchmark validation that will strengthen the interpretation of our results. We address each major comment below and commit to revisions that incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the reported scores demonstrate inherent difficulty in real-life context learning is not supported by the provided evidence. No human performance baseline on the same 405 pairs using the 5,348 rubrics is reported, nor are inter-annotator agreement statistics or details on how task difficulty and representativeness were controlled during curation. These omissions are load-bearing for interpreting low model scores as model limitations rather than artifacts of benchmark design.

    Authors: We agree that a human performance baseline would provide essential context for interpreting the low model scores as evidence of inherent difficulty rather than benchmark artifacts. We will collect and report human performance results on the full set of 405 tasks using the 5,348 rubrics in the revised manuscript. We will also add inter-annotator agreement statistics for the rubric creation process and expand the curation section with explicit details on how task difficulty was controlled (e.g., expert review for solvability) and how representativeness was ensured during selection of the 405 pairs. revision: yes

  2. Referee: [Benchmark construction and evaluation sections] Benchmark construction and evaluation sections: The manuscript states that the contexts are 'messy, fragmented, and deeply tied to personal experience' but provides no quantitative validation (e.g., distribution statistics or expert review) that the 405 pairs match actual personal-use distributions. Without such grounding, the generalizability of the 13.8% average result to 'everyday life' remains unestablished.

    Authors: We acknowledge the need for quantitative grounding to support claims of representativeness. Although the contexts were selected and verified by human experts drawing from real personal and social scenarios, the current manuscript lacks explicit distribution statistics. In the revision, we will add quantitative validation including context length distributions, participant counts, fragmentation metrics (e.g., message discontinuity rates), and summaries of expert review processes confirming alignment with everyday personal-use patterns. This will better establish the generalizability of the 13.8% average performance to real-life contexts. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical evaluation on new benchmark

full rationale

The paper introduces CL-bench Life as a human-curated benchmark of 405 context-task pairs and 5,348 rubrics, then reports straightforward model evaluation results (e.g., 19.3% best task-solving rate). No equations, derivations, fitted parameters, or predictions appear in the provided text. The central claim rests on direct performance measurements rather than any reduction of outputs to inputs by construction, self-citation chains, or ansatz smuggling. This is a standard empirical benchmark paper whose findings are independent of the inputs they describe.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that expert human curation can produce representative and reliably scorable examples of real-life context challenges; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Human curation produces reliable and representative tasks and rubrics for real-life contexts.
    The benchmark is built entirely from 405 human-curated pairs and 5,348 verification rubrics.

pith-pipeline@v0.9.0 · 5685 in / 1232 out tokens · 59404 ms · 2026-05-07T09:38:32.199538+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

211 extracted references · 30 canonical work pages · 8 internal anchors

  1. [1]

    L-eval: Instituting standardized evaluation for long context language models

    Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-eval: Instituting standardized evaluation for long context language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14388–14411, Bangkok, Thailand, August 2024. Associatio...

  2. [2]

    Introducing the next generation of claude, 2024

    Anthropic. Introducing the next generation of claude, 2024. URL https://www. anthropic.com/news/claude-3-family

  3. [3]

    System card: Claude opus 4.6, 2026

    Anthropic. System card: Claude opus 4.6, 2026. URL https://www.anthropic.com/ claude-opus-4-6-system-card. System card

  4. [4]

    arXiv preprint arXiv:2502.15840 , year =

    Axel Backlund and Lukas Petersson. Vending-bench: A benchmark for long-term coherence of autonomous agents, 2025. URLhttps://arxiv.org/abs/2502.15840

  5. [5]

    LongBench: A bilingual, multitask benchmark for long context understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 31...

  6. [6]

    Longbench: A bilingual, multitask benchmark for long context understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pp. 3119–3137, 2024

  7. [7]

    LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks

    Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Ji- azheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp...

  8. [8]

    Generative ai at work.The Quarterly Journal of Economics, 140(2):889–942, 2025

    Erik Brynjolfsson, Danielle Li, and Lindsey Raymond. Generative ai at work.The Quarterly Journal of Economics, 140(2):889–942, 2025

  9. [9]

    Gonzalez, and Ion Stoica

    Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent LLM systems fail? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025....

  10. [10]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  11. [11]

    Textworld: A learning environment for text-based games, 2019

    Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Ruo Yu Tao, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, 17 and Adam Trischler. Textworld: A learning environment for text-based games, 2019. URL https://arxiv.org/abs/1806.11532

  12. [12]

    Gemini 3.1 pro model card, 2026

    Google DeepMind. Gemini 3.1 pro model card, 2026. URLhttps://deepmind.google/ models/model-cards/gemini-3-1-pro/. Model card

  13. [13]

    DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Ha...

  14. [14]

    Mind2web: Towards a generalist agent for the web

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=kiYqbO3wqw

  15. [15]

    Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms

    Kaustubh Deshpande, Ved Sirdeshmukh, Johannes Baptist Mols, Lifeng Jin, Ed-Yeremai Hernandez-Cardona, Dean Lee, Jeremy Kritz, Willow E Primack, Summer Yue, and Chen Xing. Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 18632–18...

  16. [16]

    Octobench: Benchmarking scaffold-aware instruction following in repository-grounded agentic coding,

    Deming Ding, Shichun Liu, Enhui Yang, Jiahang Lin, Ziying Chen, Shihan Dou, Honglin Guo, Weiyu Cheng, Pengyu Zhao, Chengjun Xiao, et al. Octobench: Benchmarking scaffold-aware instruction following in repository-grounded agentic coding.arXiv preprint arXiv:2601.10343, 2026. 18

  17. [17]

    A survey on in-context learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al. A survey on in-context learning. InProceedings of the 2024 conference on empirical methods in natural language processing, pp. 1107–1128, 2024

  18. [18]

    Bamboo: A com- prehensive benchmark for evaluating long text modeling capacities of large language models

    Zican Dong, Tianyi Tang, Junyi Li, Wayne Xin Zhao, and Ji-Rong Wen. Bamboo: A com- prehensive benchmark for evaluating long text modeling capacities of large language models. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 2086–2099, 2024

  19. [19]

    Evalearn: Quantifying the learning capability and efficiency of llms via sequential problem solving, 2025

    Shihan Dou, Ming Zhang, Chenhao Huang, Jiayi Chen, Feng Chen, Shichun Liu, Yan Liu, Chenxiao Liu, Cheng Zhong, Zongzhang Zhang, Tao Gui, Chao Xin, Chengzhi Wei, Lin Yan, Yonghui Wu, Qi Zhang, and Xuanjing Huang. Evalearn: Quantifying the learning capability and efficiency of llms via sequential problem solving, 2025. URLhttps://arxiv.org/ abs/2506.02672

  20. [20]

    Cl-bench: A benchmark for context learning

    Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, et al. Cl-bench: A benchmark for context learning. arXiv preprint arXiv:2602.03587, 2026

  21. [21]

    Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks? InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id= BRfqYrikdo

  22. [22]

    What can transformers learn in-context? a case study of simple function classes

    Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.),Advances in Neural Information Processing Systems,

  23. [23]

    URLhttps://openreview.net/forum?id=flNZJ2eOet

  24. [24]

    Vitabench: Benchmarking llm agents with versatile interactive tasks in real-world applications.arXiv preprint arXiv:2509.26490, 2025

    Wei He, Yueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, Dengchang Zhao, Hui Su, Kefeng Zhang, et al. Vitabench: Benchmarking llm agents with versatile interactive tasks in real-world applications.arXiv preprint arXiv:2509.26490, 2025

  25. [25]

    RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024. URL https://openreview.net/ forum?id=kIoBbc76Sy

  26. [26]

    CL4SE: Benchmarking Context Learning on Software Engineering

    Haichuan Hu, Quanjun Zhang, Ye Shang, Guoqing Xie, Chunrong Fang, Zhenyu Chen, and Liang Xiao. Cl4se: Benchmarking context learning on software engineering, 2026. URL https://arxiv.org/abs/2602.23047

  27. [27]

    Videowebarena: Evaluating long context multimodal agents with video understanding web tasks

    Lawrence Keunho Jang, Yinheng Li, Dan Zhao, Charles Ding, Justin Lin, Paul Pu Liang, Rogerio Bonatti, and Kazuhito Koishida. Videowebarena: Evaluating long context multimodal agents with video understanding web tasks. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=unDQOUah0F

  28. [28]

    SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=VTF8yNQM66

  29. [29]

    Babilong: Testing the limits of llms with long context reasoning-in-a-haystack

    Yury Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, and Mikhail Burtsev. Babilong: Testing the limits of llms with long context reasoning-in-a-haystack. Advances in Neural Information Processing Systems, 37:106519–106554, 2024

  30. [30]

    M4LE: A multi-ability multi-range multi-task multi- domain long-context evaluation benchmark for large language models

    Wai-Chung Kwan, Xingshan Zeng, Yufei Wang, Yusen Sun, Liangyou Li, Yuxin Jiang, Lifeng Shang, Qun Liu, and Kam-Fai Wong. M4LE: A multi-ability multi-range multi-task multi- domain long-context evaluation benchmark for large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p...

  31. [31]

    ISBN 979-8-89176-189-6

    Taewhoo Lee, Chanwoong Yoon, Kyochul Jang, Donghyeon Lee, Minju Song, Hyunjae Kim, and Jaewoo Kang. ETHIC: Evaluating large language models on long-context tasks with high information coverage. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1...

  32. [32]

    URLhttps://aclanthology.org/2025.naacl-long.283/

  33. [33]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  34. [34]

    Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. LooGLE: Can long-context language models understand long contexts? InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 16304–16333, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/ v1/2024.acl-...

  35. [35]

    Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. Loogle: Can long-context language models understand long contexts? InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 16304–16333, 2024

  36. [36]

    Wildbench: Benchmarking LLMs with challeng- ing tasks from real users in the wild

    Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. Wildbench: Benchmarking LLMs with challeng- ing tasks from real users in the wild. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=MKEHCx25xp

  37. [37]

    Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

  38. [38]

    Agentbench: Evaluating LLMs as agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating LLMs as agents. InThe Twelfth International Conference on Learning R...

  39. [39]

    Agentboard: An analytical evaluation board of multi-turn LLM agents

    Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. Agentboard: An analytical evaluation board of multi-turn LLM agents. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URLhttps://openreview.net/forum?id= 4S8agvKjle

  40. [40]

    Evaluating Very Long-Term Conversational Memory of

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13851–13870, Bangkok, Thailand, August 2024. Association for Computational Li...

  41. [41]

    A Survey of Context Engineering for Large Language Models

    Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, et al. A survey of context engineering for large language models. arXiv preprint arXiv:2507.13334, 2025

  42. [42]

    GAIA: a benchmark for General AI Assistants

    Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants, 2023. URL https://arxiv.org/ abs/2311.12983

  43. [43]

    Minimax m2.5: Built for real-world productivity, 2026

    MiniMax. Minimax m2.5: Built for real-world productivity, 2026. URLhttps://minimax. io/news/minimax-m25. Official release page. 20

  44. [44]

    Evaluation and benchmark- ing of llm agents: A survey

    Mahmoud Mohammadi, Yipeng Li, Jane Lo, and Wendy Yip. Evaluation and benchmark- ing of llm agents: A survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2, KDD ’25, pp. 6129–6139. Association for Com- puting Machinery, 2025. ISBN 9798400714542. doi: 10.1145/3711896.3736570. URL https://doi.org/10.1145/3711896.3736570

  45. [45]

    Introducing GPT-5

    OpenAI. Introducing GPT-5. https://openai.com/index/ introducing-gpt-5-4/, 2025. Accessed: 2025

  46. [46]

    Gpt-5.1, 2025

    OpenAI. Gpt-5.1, 2025. URL https://openai.com/zh-Hans-CN/index/ gpt-5-1/

  47. [47]

    O’Brien, Carrie J

    Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InIn the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), UIST ’23, New York, NY , USA, 2023. Association for Computing Machinery

  48. [48]

    ChatDev: Communicative agents for software development

    Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. ChatDev: Communicative agents for software development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15174–15186, Bang...

  49. [49]

    ToolLLM: Facilitating large language models to master 16000+ real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, dahai li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. InThe Twelfth International Conference on Learning...

  50. [50]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

  51. [51]

    Qwen3.6-Plus: Towards real world agents, April 2026

    Qwen Team. Qwen3.6-Plus: Towards real world agents, April 2026. URLhttps://qwen. ai/blog?id=qwen3.6

  52. [52]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=Yacmpz84TH

  53. [53]

    Counting-stars: A multi-evidence, position-aware, and scalable benchmark for evaluating long-context large language models

    Mingyang Song, Mao Zheng, and Xuan Luo. Counting-stars: A multi-evidence, position-aware, and scalable benchmark for evaluating long-context large language models. InProceedings of the 31st International Conference on Computational Linguistics, pp. 3753–3763, 2025

  54. [54]

    OpenClaw: The ai that actually does things

    Peter Steinberger and OpenClaw Contributors. OpenClaw: The ai that actually does things. https://github.com/openclaw/openclaw, 2025. Open-source AI agent framework. MIT License. Accessed: 2025

  55. [55]

    Griffiths

    Theodore Sumers, Shunyu Yao, Karthik R Narasimhan, and Thomas L. Griffiths. Cognitive architectures for language agents.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=1i6ZCvflQJ. Survey Cer- tification, Featured Certification

  56. [56]

    Collab-overcooked: Benchmarking and evaluating large language models as collaborative agents

    Haochen Sun, Shuwen Zhang, Lujie Niu, Lei Ren, Hao Xu, Hao Fu, Fangkun Zhao, Caixia Yuan, and Xiaojie Wang. Collab-overcooked: Benchmarking and evaluating large language models as collaborative agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 4922–4951, Suzhou, China, November 2025. Association for Comp...

  57. [57]

    Effective context engineering for ai agents, Sep 2025

    Anthropic Applied AI Team. Effective context engineering for ai agents, Sep 2025. URL https://www.anthropic.com/engineering/ effective-context-engineering-for-ai-agents. 21

  58. [58]

    Seed 2.0 official launch, 2026

    ByteDance Seed Team. Seed 2.0 official launch, 2026. URL https://seed.bytedance. com/en/blog/seed-2-0-official-launch. Official launch page

  59. [59]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Che...

  60. [60]

    A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

  61. [61]

    Memoryllm: towards self- updatable large language models

    Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, Jingbo Shang, and Julian McAuley. Memoryllm: towards self- updatable large language models. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024. 22

  62. [62]

    Long- memeval: Benchmarking chat assistants on long-term interactive memory

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview. net/forum?id=pZiyCaVuti

  63. [63]

    Grok 4 model card, 2025

    xAI. Grok 4 model card, 2025. URL https://data.x.ai/ 2025-08-20-grok-4-model-card.pdf. Model card

  64. [64]

    Travelplanner: A benchmark for real-world planning with language agents

    Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. Travelplanner: A benchmark for real-world planning with language agents. In Forty-first International Conference on Machine Learning, 2024

  65. [65]

    OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InThe Thirty-eight Conference on N...

  66. [66]

    Beyond goldfish memory: Long-term open- domain conversation

    Jing Xu, Arthur Szlam, and Jason Weston. Beyond goldfish memory: Long-term open- domain conversation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5180–5197, Dublin, Ireland, May

  67. [67]

    Beyond Goldfish Memory: Long-Term Open-Domain Conversation , url =

    Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.356. URL https://aclanthology.org/2022.acl-long.356/

  68. [68]

    Search-in-the- chain: Interactively enhancing large language models with search for knowledge-intensive tasks

    Shicheng Xu, Liang Pang, Huawei Shen, Xueqi Cheng, and Tat-Seng Chua. Search-in-the- chain: Interactively enhancing large language models with search for knowledge-intensive tasks. InThe Web Conference 2024, 2024. URL https://openreview.net/forum? id=tr0TcqitMH

  69. [69]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  70. [70]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2369–2380, Brussels, Belgium, October-November 2018. Association for C...

  71. [71]

    Webshop: Towards scalable real-world web interaction with grounded language agents

    Shunyu Yao, Howard Chen, John Yang, and Karthik R Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.),Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=R9KnuFlvnU

  72. [72]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains, 2024.URL https://arxiv. org/abs/2406.12045, 2024

  73. [73]

    Tooleyes: Fine-grained evaluation for tool learning capabilities of large language models in real-world scenarios

    Junjie Ye, Guanyu Li, Songyang Gao, Caishuang Huang, Yilong Wu, Sixian Li, Xiaoran Fan, Shihan Dou, Tao Ji, Qi Zhang, Tao Gui, and Xuanjing Huang. Tooleyes: Fine-grained evaluation for tool learning capabilities of large language models in real-world scenarios. InProceedings of the 31st International Conference on Computational Linguistics, COLING 2025, A...

  74. [74]

    Helmet: How to evaluate long-context language models effectively and thoroughly, 2025

    Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, and Danqi Chen. Helmet: How to evaluate long-context language models effectively and thoroughly.arXiv preprint arXiv:2410.02694, 2024. 23

  75. [75]

    HELMET: How to evaluate long-context models effectively and thoroughly

    Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, and Danqi Chen. HELMET: How to evaluate long-context models effectively and thoroughly. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=293V3bJbmE

  76. [76]

    MMAU: A holistic benchmark of agent capabilities across diverse domains

    Guoli Yin, Haoping Bai, Shuang Ma, Feng Nan, Yanchao Sun, Zhaoyang Xu, Shen Ma, Jiarui Lu, Xiang Kong, Aonan Zhang, Dian Ang Yap, Yizhe Zhang, Karsten Ahnert, Vik Kamath, Mathias Berglund, Dominic Walsh, Tobias Gindele, Juergen Wiest, Zhengfeng Lai, Xiaoming Simon Wang, Jiulong Shan, Meng Cao, Ruoming Pang, and Zirui Wang. MMAU: A holistic benchmark of ag...

  77. [77]

    Retrieval-infused reasoning sandbox: A benchmark for decoupling retrieval and reasoning capabilities.arXiv preprint arXiv:2601.21937, 2026

    Shuangshuang Ying, Zheyu Wang, Yunjian Peng, Jin Chen, Yuhao Wu, Hongbin Lin, Dingyu He, Siyi Liu, Gengchen Yu, YinZhu Piao, et al. Retrieval-infused reasoning sandbox: A benchmark for decoupling retrieval and reasoning capabilities.arXiv preprint arXiv:2601.21937, 2026

  78. [78]

    Longcite: Enabling llms to generate fine-grained citations in long-context qa

    Jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, et al. Longcite: Enabling llms to generate fine-grained citations in long-context qa. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 5098–5122, 2025

  79. [79]

    Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Hao, Xu Han, Zhen Thai, Shuo Wang, Zhiyuan Liu, et al.∞-bench: Extending long context evaluation beyond 100k tokens.arXiv preprint arXiv:2402.13718, 2024

  80. [80]

    MemoryBank: Enhancing large language models with long-term memory

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: enhancing large language models with long-term memory. InProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Inte...

Showing first 80 references.