Interactive Evaluation Requires a Design Science

Adrian Weller; Jiaxin Pei; Jiaxuan You; Keyang Xuan; Manling Li; Pan Lu; Peiyang Song; Pengrui Han; Wenkai Li; Wenyue Hua

arxiv: 2605.17829 · v1 · pith:V3OHEIA5new · submitted 2026-05-18 · 💻 cs.AI

Interactive Evaluation Requires a Design Science

Keyang Xuan , Peiyang Song , Pan Lu , Pengrui Han , Wenkai Li , Zhenyu Zhang , Zexue He , Wenyue Hua

show 5 more authors

Manling Li Jiaxuan You Adrian Weller Yizhong Wang Jiaxin Pei

This is my paper

Pith reviewed 2026-05-20 10:52 UTC · model grok-4.3

classification 💻 cs.AI

keywords interactive evaluationAI benchmarksLLM agentsevaluation paradigmstrajectoriesrobustnessrecoverabilitydesign principles

0 comments

The pith

AI evaluation for interactive agents requires a new design science that maps trajectories to judgments of process and system performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that current evaluation practices, built for fixed inputs and isolated outputs, fail to handle AI systems that act over time through tools, environments, and other agents. It redefines evaluation as an autonomous mapping from evidence to judgments, then shows how interactive settings change both sides: evidence becomes full interaction trajectories, and judgments must cover process, recoverability, coordination, robustness, and overall system behavior. To replace the resulting fragmented collection of benchmarks, the authors introduce a two-axis taxonomy, design principles, and reporting standards. A sympathetic reader would care because without this shift, claims made by interactive benchmarks remain incomparable and difficult to translate into real deployment decisions.

Core claim

Evaluation is an autonomous mapping from evidence to judgments. Interactive evaluation changes both sides of this mapping: the evidence becomes interaction-generated trajectories, while the evaluation procedure must assess process, recoverability, coordination, robustness, and system-level performance. Building on this definition, a two-axis taxonomy organizes the space, design principles and reporting standards are derived, representative scenarios are examined, and longstanding evaluation challenges are shown to reappear at the trajectory level.

What carries the argument

The redefinition of evaluation as an autonomous mapping from evidence to judgments, which structures the two-axis taxonomy for classifying interactive benchmarks and deriving consistent procedures.

Load-bearing premise

Simply adopting previous evaluation paradigms does not suffice for interactive settings, so a new principled design science is required to organize the fragmented landscape of benchmarks.

What would settle it

A demonstration that existing single-response evaluation methods can fully capture, score, and compare interactive agent systems without new principles for trajectories or process assessment would falsify the central claim.

read the original abstract

AI evaluation is undergoing a structural change. Large language models (LLMs) are increasingly deployed as systems that act over time through tools, environments, users, and other agents, while many evaluation practices still inherit assumptions from response-centered benchmarks (e.g., fixed inputs, isolated outputs, and outcome judgments that can be made from a single response). The field has begun to build interactive benchmarks, but the resulting landscape is fragmented: benchmarks differ in what interaction artifacts they admit, how trajectories are scored, and what claims their results support. This position paper argues that interactive evaluation should be treated as a principled evaluation paradigm, not merely a new family of agent benchmarks. Simply adopting previous evaluation paradigms does not suffice. We define evaluation as an autonomous mapping from evidence to judgments, and show that interactive evaluation changes both sides of this mapping: the evidence becomes interaction-generated trajectories, while the evaluation procedure must assess process, recoverability, coordination, robustness, and system-level performance. Building on this definition, we propose a two-axis taxonomy, derive design principles and reporting standards, examine representative scenarios, and analyze how longstanding evaluation challenges reappear at the trajectory level.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. This position paper claims that interactive evaluation of AI systems (e.g., LLMs acting over time via tools and environments) constitutes a distinct paradigm requiring a principled design science, rather than adaptations of response-centered benchmarks. It defines evaluation as an autonomous mapping from evidence to judgments, shows that interactive settings replace fixed inputs with trajectories and require assessing process, recoverability, coordination, robustness, and system-level performance, and proposes a two-axis taxonomy, design principles, reporting standards, representative scenarios, and re-examination of longstanding challenges at the trajectory level.

Significance. If the argument holds, the paper provides a coherent conceptual reorganization of a fragmented area, with the definitional grounding and logical derivation of changed evidence/judgment requirements as clear strengths. It could guide more comparable and principled benchmark development for agentic systems. The contribution is primarily organizational rather than empirical; its influence would increase if the taxonomy demonstrably clarifies existing work.

major comments (1)

The section proposing the two-axis taxonomy: the claim that this taxonomy (and associated design principles) addresses fragmentation is central, yet the manuscript provides no concrete application re-classifying or re-analyzing even two or three existing interactive benchmarks to illustrate how the axes clarify differences in admitted artifacts, scoring, or supported claims.

minor comments (1)

The dimensions of evaluation (process, recoverability, coordination, robustness, system-level performance) are listed without explicit definitions or short illustrative examples in the main text, which reduces clarity for readers new to trajectory-based assessment.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the constructive suggestion regarding the taxonomy section. We agree that concrete illustrations will strengthen the paper's ability to demonstrate how the proposed framework organizes the field.

read point-by-point responses

Referee: The section proposing the two-axis taxonomy: the claim that this taxonomy (and associated design principles) addresses fragmentation is central, yet the manuscript provides no concrete application re-classifying or re-analyzing even two or three existing interactive benchmarks to illustrate how the axes clarify differences in admitted artifacts, scoring, or supported claims.

Authors: We accept this observation. The taxonomy is intended to provide a principled lens for comparing interactive evaluations, but without explicit re-applications the claim remains abstract. In the revised version we add a dedicated subsection that applies both axes to re-classify three existing benchmarks (WebArena, ToolBench, and a representative multi-agent coordination environment). For each benchmark we explicitly map the admitted interaction artifacts, the scoring procedures used, and the system-level claims that can be supported, thereby showing how the taxonomy surfaces previously unarticulated differences and reduces fragmentation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in definitional position paper

full rationale

The paper's argument begins with an explicit definition of evaluation as an autonomous mapping from evidence to judgments and logically extends this to interactive settings by observing that evidence shifts to trajectories and that assessment must incorporate process, recoverability, coordination, robustness, and system-level performance. This extension is presented as a direct consequence of the chosen definition rather than a reduction to fitted inputs, self-citations, or prior results. The subsequent proposal of a two-axis taxonomy, design principles, and reporting standards follows as an organizational framework for addressing fragmentation, without any load-bearing step that equates a claimed derivation to its own inputs by construction. No equations, parameter fits, or uniqueness theorems are invoked in a manner that creates circularity; the contribution remains self-contained as a call for principled structure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on a definitional premise about what evaluation is and the assertion that prior paradigms are insufficient; no numerical parameters or new physical entities are introduced.

axioms (1)

domain assumption Evaluation is an autonomous mapping from evidence to judgments.
Stated in the abstract as the starting definition that interactive evaluation modifies.

pith-pipeline@v0.9.0 · 5769 in / 1180 out tokens · 28949 ms · 2026-05-20T10:52:05.490709+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define evaluation as an autonomous mapping from evidence to judgments... interactive evaluation changes both sides of this mapping: the evidence becomes interaction-generated trajectories, while the evaluation procedure must assess process, recoverability, coordination, robustness...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-axis taxonomy... Axis 1: Evaluation Inputs... Axis 2: Evaluation Programs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

121 extracted references · 121 canonical work pages · 23 internal anchors

[1]

Sahar Abdelnabi, Amr Gomaa, Sarath Sivaprasad, Lea Schonherr, and Mario Fritz. Cooperation, competition, and maliciousness: Llm-stakeholders interactive negotiation.Advances in Neural Information Processing Systems 37, 2023.https://api.semanticscholar.org/CorpusID:263310628. Melissa Ailem, Katerina Marazopoulou, Charlotte Siska, and James Bono. Examining ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

How and why llms generalize: A fine-grained analysis of llm reasoning from cognitive behaviors to low-level patterns

Haoyue Bai, Yiyou Sun, Wenjie Hu, Shi Qiu, Maggie Ziyu Huan, Peiyang Song, Robert Nowak, and Dawn Song. How and why llms generalize: A fine-grained analysis of llm reasoning from cognitive behaviors to low-level patterns. arXiv preprint arXiv:2512.24063,

work page arXiv
[3]

Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms.ArXiv, abs/2402.03927, 2024.https://api.semanticscholar.org/ CorpusID:267499939

Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, and Ondrej Dusek. Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms.ArXiv, abs/2402.03927, 2024.https://api.semanticscholar.org/ CorpusID:267499939. Samuel Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning na...

work page arXiv 2024
[4]

clembench: Using game play to evaluate chat-optimized language models as conversational agents

Kranti Chalamalasetti, Jana Götze, Sherzod Hakimov, Brielen Madureira, Philipp Sadler, and David Schlangen. clembench: Using game play to evaluate chat-optimized language models as conversational agents. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 11174–11219,

work page 2023
[5]

Private benchmarking to prevent contamination and improve comparative evaluation of llms.ArXiv, abs/2403.00393, 2024.https://api.semanticscholar.org/CorpusID:268201479

Nishanth Chandran, Sunayana Sitaram, Divya Gupta, Rahul Sharma, Kashish Mittal, and Manohar Swaminathan. Private benchmarking to prevent contamination and improve comparative evaluation of llms.ArXiv, abs/2403.00393, 2024.https://api.semanticscholar.org/CorpusID:268201479. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto,...

work page arXiv 2024
[6]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference.arXiv preprint arXiv:2403.04132,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Ai impact on human proof formalization workflows

Katherine M Collins, Simon Frieder, Jonas Bayer, Jacob Loader, Jeck Lim, Peiyang Song, Fabian Zaiser, Lexin Zhou, Shanda Li, Shi-Zhuo Looi, et al. Ai impact on human proof formalization workflows. InThe 5th Workshop on Mathematical Reasoning and AI at NeurIPS 2025,

work page 2025
[9]

Davidson, Veniamin Veselovsky, Martin Josifoski, Maxime Peyrard, Antoine Bosselut, Michal Kosinski, and Robert West

11 Tim R. Davidson, Veniamin Veselovsky, Martin Josifoski, Maxime Peyrard, Antoine Bosselut, Michal Kosinski, and Robert West. Evaluating language model agency through negotiations.ArXiv, abs/2401.04536,

work page arXiv
[10]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark B. Gerstein, and Arman Cohan. Investigating data contamination in modern benchmarks for large language models. InNorth American Chapter of the Association for Computational Linguistics, 2023a.https://api.semanticscholar.org/CorpusID:265220695. Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wa...

work page internal anchor Pith review Pith/arXiv arXiv
[11]

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs, 2019.https://arxiv.org/abs/1903.00161. Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evalu...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[12]

Maseval: Extending multi-agent evaluation from models to systems.arXiv preprint arXiv:2603.08835,

Cornelius Emde, Alexander Rubinstein, Anmol Goel, Ahmed Heakl, Sangdoo Yun, Seong Joon Oh, and Martin Gubri. Maseval: Extending multi-agent evaluation from models to systems.arXiv preprint arXiv:2603.08835,

work page arXiv
[13]

Agentprocessbench: Diagnosing step-level process quality in tool-using agents.arXiv preprint arXiv:2603.14465, 2026

Shengda Fan, Xuyan Ye, Yupeng Huo, Zhi-Yuan Chen, Yiju Guo, Shenzhi Yang, Wenkai Yang, Shuqi Ye, Jingwen Chen, Haotian Chen, Xin Cong, and Yankai Lin. Agentprocessbench: Diagnosing step-level process quality in tool-using agents, 2026.https://arxiv.org/abs/2603.14465. Yukang Feng, Jianwen Sun, Zelai Yang, Jiaxin Ai, Chuanhao Li, Zizhen Li, Fanrui Zhang, K...

work page arXiv 2026
[14]

2509.17158 , archivePrefix=

ARC Prize Foundation. Arc-agi-3: A new challenge for frontier agentic intelligence, 2026.https://arxiv.org/abs/2603. 24621. Romain Froger, Pierre Andrews, Matteo Bettini, Amar Budhiraja, Ricardo Silveira Cabral, Virginie Do, Emilien Garreau, Jean-Baptiste Gaya, Hugo Laurençon, Maxime Lecanu, et al. Are: Scaling up agent environments and evaluations.arXiv ...

work page arXiv 2026
[15]

Omni-math: A universal olympiad level mathematic benchmark for large language models

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Zhengyang Tang, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models. In International Conference on Learning Representations, volume 2025, pages 100540–100569,

work page 2025
[16]

Leanprogress: Guiding search for neural theorem proving via proof progress prediction.arXiv preprint arXiv:2502.17925,

Robert Joseph George, Suozhi Huang, Peiyang Song, and Anima Anandkumar. Leanprogress: Guiding search for neural theorem proving via proof progress prediction.arXiv preprint arXiv:2502.17925,

work page arXiv
[17]

Builderbench–a benchmark for generalist agents.arXiv preprint arXiv:2510.06288,

Raj Ghugare, Catherine Ji, Kathryn Wantlin, Jin Schofield, and Benjamin Eysenbach. Builderbench–a benchmark for generalist agents.arXiv preprint arXiv:2510.06288,

work page arXiv
[18]

Time travel in llms: Tracing data contamination in large language models.arXiv preprint arXiv:2308.08493, 2023

Shahriar Golchin and Mihai Surdeanu. Time travel in llms: Tracing data contamination in large language models. ArXiv, abs/2308.08493, 2023.https://api.semanticscholar.org/CorpusID:260925501. 12 Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc’Aurelio Ranzato, Francisco Guzmán, and Angela Fan. The ...

work page arXiv 2023
[19]

Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models

Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu. Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11143–11156,

work page 2024
[20]

In-context learning may not elicit trustworthy reasoning: A-not-b errors in pretrained language models

Pengrui Han, Peiyang Song, Haofei Yu, and Jiaxuan You. In-context learning may not elicit trustworthy reasoning: A-not-b errors in pretrained language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 5624–5643,

work page 2024
[21]

The personality illusion: Revealing dissociation between self-reports & behavior in llms.arXiv preprint arXiv:2509.03730,

Pengrui Han, Rafal Kocielnik, Peiyang Song, Ramit Debnath, Dean Mobbs, Anima Anandkumar, and R Michael Alvarez. The personality illusion: Revealing dissociation between self-reports & behavior in llms.arXiv preprint arXiv:2509.03730,

work page arXiv
[22]

Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks.arXiv preprint arXiv:2602.16313, 2026

Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, et al. Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks.arXiv preprint arXiv:2602.16313,

work page arXiv
[23]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[24]

Measuring Coding Challenge Competence With APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Xiaodong Song, and Jacob Steinhardt. Measuring coding challenge competence with apps.ArXiv, abs/2105.09938, 2021a.https://api.semanticscholar.org/CorpusID:234790100. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, St...

work page internal anchor Pith review Pith/arXiv arXiv
[25]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Alon Jacovi, Avi Caciularu, Omer Goldman, and Yoav Goldberg. Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks. InConference on Empirical Methods in Natural Language Processing, 2023.https://api.semanticscholar.org/CorpusID:258741333. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Y...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, et al. Mrkl systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning.arXiv preprint arXiv:2205.00445,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Smith, Yejin Choi, and Kentaro Inui

Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi, Ronan Le Bras, Akari Asai, Xinyan Velocity Yu, Dragomir R. Radev, Noah A. Smith, Yejin Choi, and Kentaro Inui. Realtime qa: What’s the answer right now?ArXiv, abs/2207.13332, 2022.https://api.semanticscholar.org/CorpusID:251105205. Arpandeep Khatua, Hao Zhu, Peter Tran, Arya Prabhudesai, Frederic Sadrieh, ...

work page arXiv 2022
[28]

Dynabench: Rethinking benchmarking in NLP

Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Talat, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. Dynabench: Rethinking benchmarking in nlp.ArXiv, abs/2104.14337, 202...

work page arXiv 2021
[29]

Booksum: A collection of datasets for long-form narrative summarization

Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, and Dragomir Radev. Booksum: A collection of datasets for long-form narrative summarization. InFindings of the association for computational linguistics: EMNLP 2022, pages 6536–6558,

work page 2022
[30]

Leanagent: Lifelong learning for formal theorem proving

Adarsh Kumarappan, Mohit Tiwari, Peiyang Song, Robert Joseph George, Chaowei Xiao, et al. Leanagent: Lifelong learning for formal theorem proving. InInternational Conference on Learning Representations, volume 2025, pages 73525–73564,

work page 2025
[31]

Evaluating human-language model interaction.arXiv preprint arXiv:2212.09746,

Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus, Ashwin Paranjape, Ines Gerard-Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda Rong, et al. Evaluating human-language model interaction.arXiv preprint arXiv:2212.09746,

work page arXiv
[32]

Intellagent: A multi-agent framework for evaluating conversational ai systems.arXiv preprint arXiv:2501.11067,

Elad Levi and Ilan Kadar. Intellagent: A multi-agent framework for evaluating conversational ai systems.arXiv preprint arXiv:2501.11067,

work page arXiv
[33]

Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Mike Lewis, Denis Yarats, Yann Dauphin, Devi Parikh, and Dhruv Batra. Deal or no deal? end-to-end learning of negotiation dialogues.ArXiv, abs/1706.05125, 2017.https://api.semanticscholar.org/CorpusID:2454882. Dawei Li, Yuguang Yao, Zhen Tan, Huan Liu, and Ruocheng Guo. Toolprmbench: Evaluating and advancing process reward models for tool-using agents.ArX...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

Jiatong Li, Rui Li, and Qi Liu

https://api.semanticscholar.org/CorpusID: 284910432. Jiatong Li, Rui Li, and Qi Liu. Beyond static datasets: A deep interaction approach to llm evaluation, 2023a. https://arxiv.org/abs/2309.04369. Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augm...

work page arXiv 2023
[35]

Manning, Christopher R’e, Diana Acosta-Navas, Drew A

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher R’e, Diana Acosta-Navas, Drew A. Hudson, E. Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu ...

work page 2023
[36]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representations, volume 2024, pages 52989–53046,

work page 2024
[37]

Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, et al. Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 1160–1183,

work page 2025
[38]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents, 2024.https://arxiv.org/abs/2402.17753. David Manheim and Scott Garrabrant. Categorizing variants of goodhart’s law.ArXiv, abs/1803.04585,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

https://api.semanticscholar.org/CorpusID:4715794. R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference.ArXiv, abs/1902.01007, 2019.https://api.semanticscholar.org/CorpusID:59599752. Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich,...

work page internal anchor Pith review Pith/arXiv arXiv 1902
[40]

Stochasticity in agentic evaluations: Quantifying inconsistency with intraclass correlation.arXiv preprint arXiv:2512.06710, 2025

Zairah Mustahsan, Abel Lim, Megna Anand, Saahil Jain, and Bryan McCann. Stochasticity in agentic evaluations: Quantifying inconsistency with intraclass correlation.arXiv preprint arXiv:2512.06710,

work page arXiv
[41]

Efficient benchmarking of ai agents.arXiv preprint arXiv:2603.23749,

Franck Ndzomga. Efficient benchmarking of ai agents.arXiv preprint arXiv:2603.23749,

work page arXiv
[42]

Chatterji, Faisal Ladhak, and Tatsunori Hashimoto

Yonatan Oren, Nicole Meister, Niladri S. Chatterji, Faisal Ladhak, and Tatsunori Hashimoto. Proving test set contamination in black box language models.ArXiv, abs/2310.17623,

work page arXiv
[43]

Ethan Perez, Sam Ringer, Kamil˙ e Lukoßi¯ ut˙ e, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Benjamin Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Chris Olah, Daisong Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, G R Khundadze, John K...

work page arXiv 2022
[44]

Varbench: Robust language model benchmarking through dynamic variable perturbation.ArXiv, abs/2406.17681, 2024b

15 Kun Qian, Shu Wan, Claudia Tang, Youzhi Wang, Xuanming Zhang, Maximillian Chen, and Zhou Yu. Varbench: Robust language model benchmarking through dynamic variable perturbation.ArXiv, abs/2406.17681, 2024b. https://api.semanticscholar.org/CorpusID:270711329. Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru ...

work page arXiv
[45]

Squad: 100,000+ questions for machine comprehension of text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 2383–2392,

work page 2016
[46]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573,

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Simworld: An open- ended realistic simulator for autonomous agents in physical and social worlds, 2025

Jiawei Ren, Yan Zhuang, Xiaokang Ye, Lingjun Mao, Xuhong He, Jianzhi Shen, Mrinaal Dogra, Yiming Liang, Ruixuan Zhang, Tianai Yue, et al. Simworld: An open-ended realistic simulator for autonomous agents in physical and social worlds.arXiv preprint arXiv:2512.01078,

work page arXiv
[48]

Beyond accuracy: Behavioral testing of nlp models with checklist.ArXiv, abs/2005.04118, 2020.https://api.semanticscholar.org/CorpusID:218551201

Marco Tulio Ribeiro, Tongshuang Sherry Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of nlp models with checklist.ArXiv, abs/2005.04118, 2020.https://api.semanticscholar.org/CorpusID:218551201. Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris Maddison, and Tatsunori Hashimoto. ...

work page arXiv 2005
[49]

Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark

Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, Julen Etxaniz, Oier López de Lacalle, and Eneko Agirre. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark. InConference on Empirical Methods in Natural Language Processing, 2023.https://api.semanticscholar.org/CorpusID:264555419. Timo Schick, Jane Dwivedi-Yu, Rob...

work page 2023
[50]

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments.arXiv preprint arXiv:2405.07960,

work page internal anchor Pith review arXiv
[51]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[52]

Lean copilot: Large language models as copilots for theorem proving in lean.arXiv preprint arXiv:2404.12534,

Peiyang Song, Kaiyu Yang, and Anima Anandkumar. Lean copilot: Large language models as copilots for theorem proving in lean.arXiv preprint arXiv:2404.12534,

work page arXiv
[53]

Large language model reasoning failures.arXiv preprint arXiv:2602.06176,

Peiyang Song, Pengrui Han, and Noah Goodman. Large language model reasoning failures.arXiv preprint arXiv:2602.06176,

work page arXiv
[54]

Commonsenseqa: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158,

work page 2019
[55]

Creative and context-aware translation of east asian idioms with gpt-4

16 Kenan Tang, Peiyang Song, Yao Qin, and Xifeng Yan. Creative and context-aware translation of east asian idioms with gpt-4. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 9285–9305,

work page 2024
[56]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019.https://arxiv.org/abs/1804.07461. Jiaxuan Wang, Yulan Hu, Wenjin Yang, Zheng Pan, Xin Li, and Lan-Zhe Guo. Aligning agents via planning: A benchmark for trajectory-level reward...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[57]

Battleagentbench: A benchmark for evaluating cooperation and competition capabilities of language models in multi-agent systems.arXiv preprint arXiv:2408.15971,

Wei Wang, Dan Zhang, Tao Feng, Boyan Wang, and Jie Tang. Battleagentbench: A benchmark for evaluating cooperation and competition capabilities of language models in multi-agent systems.arXiv preprint arXiv:2408.15971,

work page arXiv
[58]

MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback.arXiv preprint arXiv:2309.10691, 2023

Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. Mint: Evaluating llms in multi-turn interaction with tools and language feedback.arXiv preprint arXiv:2309.10691,

work page arXiv
[59]

Openhands: An open platform for ai software developers as generalist agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. InInternational Conference on Learning Representations, volume 2025, pages 65882–65919,

work page 2025
[60]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents. arXiv preprint arXiv:2504.12516,

work page internal anchor Pith review Pith/arXiv arXiv
[61]

Livebench: A challenging, contamination-limited llm benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination-limited llm benchmark. InInternationa...

work page 2024
[62]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813,

work page internal anchor Pith review Pith/arXiv arXiv
[63]

Travelplanner: A benchmark for real-world planning with language agents,

Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. Travelplanner: A benchmark for real-world planning with language agents.arXiv preprint arXiv:2402.01622, 2024a. Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osw...

work page arXiv 2024
[64]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Adriano Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.ArXiv, abs/2405.15793,

work page internal anchor Pith review Pith/arXiv arXiv
[65]

ReAct: Synergizing Reasoning and Acting in Language Models

17 Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022a. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in la...

work page internal anchor Pith review Pith/arXiv arXiv
[66]

Automating dataset updates towards reliable and timely evaluation of large language models

Jiahao Ying, Yixin Cao, Yushi Bai, Qianru Sun, Bo Wang, Wei Tang, Zhaojun Ding, Yizhe Yang, Xuanjing Huang, and Shuicheng Yan. Automating dataset updates towards reliable and timely evaluation of large language models. Advances in Neural Information Processing Systems 37, 2024.https://api.semanticscholar.org/CorpusID:267750054. Lance Ying, Ryan Truong, Pr...

work page arXiv 2024
[67]

Interactive Benchmarks

Baoqing Yue, Zihan Zhu, Yifan Zhang, Jichen Feng, Hufei Yang, and Mengdi Wang. Interactive benchmarks.arXiv preprint arXiv:2603.04737,

work page internal anchor Pith review Pith/arXiv arXiv
[68]

Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction.arXiv preprint arXiv:2305.08144,

Danyang Zhang, Zhennan Shen, Rui Xie, Situo Zhang, Tianbao Xie, Zihan Zhao, Siyuan Chen, Lu Chen, Hongshen Xu, Ruisheng Cao, et al. Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction.arXiv preprint arXiv:2305.08144,

work page arXiv
[69]

Yinger Zhang, Shutong Jiang, Renhao Li, Jianhong Tu, Yang Su, Lianghao Deng, Xudong Guo, Chenxu Lv, and Junyang Lin

Kexun Zhang, Yee Choi, Zhenqiao Song, Taiqi He, William Yang Wang, and Lei Li. Hire a linguist!: Learning endangered languages in llms with in-context linguistic descriptions. InFindings of the Association for Computational Linguistics: ACL 2024, pages 15654–15669, 2024a. Yinger Zhang, Shutong Jiang, Renhao Li, Jianhong Tu, Yang Su, Lianghao Deng, Xudong ...

work page arXiv 2024
[70]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. Agent- safetybench: Evaluating the safety of llm agents.arXiv preprint arXiv:2412.14470, 2024b. Qihao Zhao, Yangyu Huang, Tengchao Lv, Lei Cui, Qinzheng Sun, Shaoguang Mao, Xin Zhang, Ying Xin, Qiufeng Yin, Scarlett Li, and Furu Wei. Mmlu-cf: A contamination-f...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023a. Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, ...

work page internal anchor Pith review Pith/arXiv arXiv
[72]

Needle in the repo: A benchmark for maintainability in ai-generated repository edits, 2026a.https://arxiv.org/abs/2603.27745

18 Haichao Zhu, Qian Zhang, Jiyuan Wang, Zhaorui Yang, and Yuxin Qiu. Needle in the repo: A benchmark for maintainability in ai-generated repository edits, 2026a.https://arxiv.org/abs/2603.27745. Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, and Xing Xie. Dyval: Dynamic evaluation of large language models for reasoning tasks. InInt...

work page arXiv
[73]

Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

https://api.semanticscholar.org/CorpusID:263310319. Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Daisy Zhe Wang, Zhenhailong Wang, Cheng Qian, Robert Tang, Heng Ji, et al. Multiagentbench: Evaluating the collaboration and competition of llm agents. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguisti...

work page arXiv 2024
[74]

A paper is retained if it appears in a top venue, or has citation velocity at least 1.5, or has at least 50 GitHub stars

We then deduplicate papers across channels using arXiv IDs when available and normalized titles otherwise, and apply a shared quality filter to obtain the final candidate set. A paper is retained if it appears in a top venue, or has citation velocity at least 1.5, or has at least 50 GitHub stars. We define citation velocity as CitationVelocity(p) = Citati...

work page 2024
[75]

2016 Reading Comprehension 11,679 — GLUE (Wang et al.,

work page 2016
[76]

2018 Reading Comprehension 10,589 — DROP (Dua et al.,

work page 2018
[77]

2019 Reading Comprehension 1,438 — CommonsenseQA (Talmor et al.,

work page 2019
[78]

2019 Commonsense Reasoning 2,666 168 MMLU (Hendrycks et al.,

work page 2019
[79]

2020 Knowledge & Multitask Reasoning 7,833 1.4k GSM8k (Cobbe et al.,

work page 2020
[80]

2021 Math Reasoning 8,894 1.4k MATH (Hendrycks et al., 2021b) 2021 Math Reasoning 397 433 MiniF2F (Zheng et al.,

work page 2021

Showing first 80 references.

[1] [1]

Sahar Abdelnabi, Amr Gomaa, Sarath Sivaprasad, Lea Schonherr, and Mario Fritz. Cooperation, competition, and maliciousness: Llm-stakeholders interactive negotiation.Advances in Neural Information Processing Systems 37, 2023.https://api.semanticscholar.org/CorpusID:263310628. Melissa Ailem, Katerina Marazopoulou, Charlotte Siska, and James Bono. Examining ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

How and why llms generalize: A fine-grained analysis of llm reasoning from cognitive behaviors to low-level patterns

Haoyue Bai, Yiyou Sun, Wenjie Hu, Shi Qiu, Maggie Ziyu Huan, Peiyang Song, Robert Nowak, and Dawn Song. How and why llms generalize: A fine-grained analysis of llm reasoning from cognitive behaviors to low-level patterns. arXiv preprint arXiv:2512.24063,

work page arXiv

[3] [3]

Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms.ArXiv, abs/2402.03927, 2024.https://api.semanticscholar.org/ CorpusID:267499939

Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, and Ondrej Dusek. Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms.ArXiv, abs/2402.03927, 2024.https://api.semanticscholar.org/ CorpusID:267499939. Samuel Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning na...

work page arXiv 2024

[4] [4]

clembench: Using game play to evaluate chat-optimized language models as conversational agents

Kranti Chalamalasetti, Jana Götze, Sherzod Hakimov, Brielen Madureira, Philipp Sadler, and David Schlangen. clembench: Using game play to evaluate chat-optimized language models as conversational agents. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 11174–11219,

work page 2023

[5] [5]

Private benchmarking to prevent contamination and improve comparative evaluation of llms.ArXiv, abs/2403.00393, 2024.https://api.semanticscholar.org/CorpusID:268201479

Nishanth Chandran, Sunayana Sitaram, Divya Gupta, Rahul Sharma, Kashish Mittal, and Manohar Swaminathan. Private benchmarking to prevent contamination and improve comparative evaluation of llms.ArXiv, abs/2403.00393, 2024.https://api.semanticscholar.org/CorpusID:268201479. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto,...

work page arXiv 2024

[6] [6]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference.arXiv preprint arXiv:2403.04132,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Ai impact on human proof formalization workflows

Katherine M Collins, Simon Frieder, Jonas Bayer, Jacob Loader, Jeck Lim, Peiyang Song, Fabian Zaiser, Lexin Zhou, Shanda Li, Shi-Zhuo Looi, et al. Ai impact on human proof formalization workflows. InThe 5th Workshop on Mathematical Reasoning and AI at NeurIPS 2025,

work page 2025

[9] [9]

Davidson, Veniamin Veselovsky, Martin Josifoski, Maxime Peyrard, Antoine Bosselut, Michal Kosinski, and Robert West

11 Tim R. Davidson, Veniamin Veselovsky, Martin Josifoski, Maxime Peyrard, Antoine Bosselut, Michal Kosinski, and Robert West. Evaluating language model agency through negotiations.ArXiv, abs/2401.04536,

work page arXiv

[10] [10]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark B. Gerstein, and Arman Cohan. Investigating data contamination in modern benchmarks for large language models. InNorth American Chapter of the Association for Computational Linguistics, 2023a.https://api.semanticscholar.org/CorpusID:265220695. Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wa...

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs, 2019.https://arxiv.org/abs/1903.00161. Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evalu...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[12] [12]

Maseval: Extending multi-agent evaluation from models to systems.arXiv preprint arXiv:2603.08835,

Cornelius Emde, Alexander Rubinstein, Anmol Goel, Ahmed Heakl, Sangdoo Yun, Seong Joon Oh, and Martin Gubri. Maseval: Extending multi-agent evaluation from models to systems.arXiv preprint arXiv:2603.08835,

work page arXiv

[13] [13]

Agentprocessbench: Diagnosing step-level process quality in tool-using agents.arXiv preprint arXiv:2603.14465, 2026

Shengda Fan, Xuyan Ye, Yupeng Huo, Zhi-Yuan Chen, Yiju Guo, Shenzhi Yang, Wenkai Yang, Shuqi Ye, Jingwen Chen, Haotian Chen, Xin Cong, and Yankai Lin. Agentprocessbench: Diagnosing step-level process quality in tool-using agents, 2026.https://arxiv.org/abs/2603.14465. Yukang Feng, Jianwen Sun, Zelai Yang, Jiaxin Ai, Chuanhao Li, Zizhen Li, Fanrui Zhang, K...

work page arXiv 2026

[14] [14]

2509.17158 , archivePrefix=

ARC Prize Foundation. Arc-agi-3: A new challenge for frontier agentic intelligence, 2026.https://arxiv.org/abs/2603. 24621. Romain Froger, Pierre Andrews, Matteo Bettini, Amar Budhiraja, Ricardo Silveira Cabral, Virginie Do, Emilien Garreau, Jean-Baptiste Gaya, Hugo Laurençon, Maxime Lecanu, et al. Are: Scaling up agent environments and evaluations.arXiv ...

work page arXiv 2026

[15] [15]

Omni-math: A universal olympiad level mathematic benchmark for large language models

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Zhengyang Tang, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models. In International Conference on Learning Representations, volume 2025, pages 100540–100569,

work page 2025

[16] [16]

Leanprogress: Guiding search for neural theorem proving via proof progress prediction.arXiv preprint arXiv:2502.17925,

Robert Joseph George, Suozhi Huang, Peiyang Song, and Anima Anandkumar. Leanprogress: Guiding search for neural theorem proving via proof progress prediction.arXiv preprint arXiv:2502.17925,

work page arXiv

[17] [17]

Builderbench–a benchmark for generalist agents.arXiv preprint arXiv:2510.06288,

Raj Ghugare, Catherine Ji, Kathryn Wantlin, Jin Schofield, and Benjamin Eysenbach. Builderbench–a benchmark for generalist agents.arXiv preprint arXiv:2510.06288,

work page arXiv

[18] [18]

Time travel in llms: Tracing data contamination in large language models.arXiv preprint arXiv:2308.08493, 2023

Shahriar Golchin and Mihai Surdeanu. Time travel in llms: Tracing data contamination in large language models. ArXiv, abs/2308.08493, 2023.https://api.semanticscholar.org/CorpusID:260925501. 12 Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc’Aurelio Ranzato, Francisco Guzmán, and Angela Fan. The ...

work page arXiv 2023

[19] [19]

Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models

Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu. Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11143–11156,

work page 2024

[20] [20]

In-context learning may not elicit trustworthy reasoning: A-not-b errors in pretrained language models

Pengrui Han, Peiyang Song, Haofei Yu, and Jiaxuan You. In-context learning may not elicit trustworthy reasoning: A-not-b errors in pretrained language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 5624–5643,

work page 2024

[21] [21]

The personality illusion: Revealing dissociation between self-reports & behavior in llms.arXiv preprint arXiv:2509.03730,

Pengrui Han, Rafal Kocielnik, Peiyang Song, Ramit Debnath, Dean Mobbs, Anima Anandkumar, and R Michael Alvarez. The personality illusion: Revealing dissociation between self-reports & behavior in llms.arXiv preprint arXiv:2509.03730,

work page arXiv

[22] [22]

Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks.arXiv preprint arXiv:2602.16313, 2026

Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, et al. Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks.arXiv preprint arXiv:2602.16313,

work page arXiv

[23] [23]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[24] [24]

Measuring Coding Challenge Competence With APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Xiaodong Song, and Jacob Steinhardt. Measuring coding challenge competence with apps.ArXiv, abs/2105.09938, 2021a.https://api.semanticscholar.org/CorpusID:234790100. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, St...

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Alon Jacovi, Avi Caciularu, Omer Goldman, and Yoav Goldberg. Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks. InConference on Empirical Methods in Natural Language Processing, 2023.https://api.semanticscholar.org/CorpusID:258741333. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Y...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, et al. Mrkl systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning.arXiv preprint arXiv:2205.00445,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Smith, Yejin Choi, and Kentaro Inui

Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi, Ronan Le Bras, Akari Asai, Xinyan Velocity Yu, Dragomir R. Radev, Noah A. Smith, Yejin Choi, and Kentaro Inui. Realtime qa: What’s the answer right now?ArXiv, abs/2207.13332, 2022.https://api.semanticscholar.org/CorpusID:251105205. Arpandeep Khatua, Hao Zhu, Peter Tran, Arya Prabhudesai, Frederic Sadrieh, ...

work page arXiv 2022

[28] [28]

Dynabench: Rethinking benchmarking in NLP

Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Talat, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. Dynabench: Rethinking benchmarking in nlp.ArXiv, abs/2104.14337, 202...

work page arXiv 2021

[29] [29]

Booksum: A collection of datasets for long-form narrative summarization

Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, and Dragomir Radev. Booksum: A collection of datasets for long-form narrative summarization. InFindings of the association for computational linguistics: EMNLP 2022, pages 6536–6558,

work page 2022

[30] [30]

Leanagent: Lifelong learning for formal theorem proving

Adarsh Kumarappan, Mohit Tiwari, Peiyang Song, Robert Joseph George, Chaowei Xiao, et al. Leanagent: Lifelong learning for formal theorem proving. InInternational Conference on Learning Representations, volume 2025, pages 73525–73564,

work page 2025

[31] [31]

Evaluating human-language model interaction.arXiv preprint arXiv:2212.09746,

Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus, Ashwin Paranjape, Ines Gerard-Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda Rong, et al. Evaluating human-language model interaction.arXiv preprint arXiv:2212.09746,

work page arXiv

[32] [32]

Intellagent: A multi-agent framework for evaluating conversational ai systems.arXiv preprint arXiv:2501.11067,

Elad Levi and Ilan Kadar. Intellagent: A multi-agent framework for evaluating conversational ai systems.arXiv preprint arXiv:2501.11067,

work page arXiv

[33] [33]

Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Mike Lewis, Denis Yarats, Yann Dauphin, Devi Parikh, and Dhruv Batra. Deal or no deal? end-to-end learning of negotiation dialogues.ArXiv, abs/1706.05125, 2017.https://api.semanticscholar.org/CorpusID:2454882. Dawei Li, Yuguang Yao, Zhen Tan, Huan Liu, and Ruocheng Guo. Toolprmbench: Evaluating and advancing process reward models for tool-using agents.ArX...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[34] [34]

Jiatong Li, Rui Li, and Qi Liu

https://api.semanticscholar.org/CorpusID: 284910432. Jiatong Li, Rui Li, and Qi Liu. Beyond static datasets: A deep interaction approach to llm evaluation, 2023a. https://arxiv.org/abs/2309.04369. Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augm...

work page arXiv 2023

[35] [35]

Manning, Christopher R’e, Diana Acosta-Navas, Drew A

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher R’e, Diana Acosta-Navas, Drew A. Hudson, E. Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu ...

work page 2023

[36] [36]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representations, volume 2024, pages 52989–53046,

work page 2024

[37] [37]

Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, et al. Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 1160–1183,

work page 2025

[38] [38]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents, 2024.https://arxiv.org/abs/2402.17753. David Manheim and Scott Garrabrant. Categorizing variants of goodhart’s law.ArXiv, abs/1803.04585,

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

https://api.semanticscholar.org/CorpusID:4715794. R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference.ArXiv, abs/1902.01007, 2019.https://api.semanticscholar.org/CorpusID:59599752. Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich,...

work page internal anchor Pith review Pith/arXiv arXiv 1902

[40] [40]

Stochasticity in agentic evaluations: Quantifying inconsistency with intraclass correlation.arXiv preprint arXiv:2512.06710, 2025

Zairah Mustahsan, Abel Lim, Megna Anand, Saahil Jain, and Bryan McCann. Stochasticity in agentic evaluations: Quantifying inconsistency with intraclass correlation.arXiv preprint arXiv:2512.06710,

work page arXiv

[41] [41]

Efficient benchmarking of ai agents.arXiv preprint arXiv:2603.23749,

Franck Ndzomga. Efficient benchmarking of ai agents.arXiv preprint arXiv:2603.23749,

work page arXiv

[42] [42]

Chatterji, Faisal Ladhak, and Tatsunori Hashimoto

Yonatan Oren, Nicole Meister, Niladri S. Chatterji, Faisal Ladhak, and Tatsunori Hashimoto. Proving test set contamination in black box language models.ArXiv, abs/2310.17623,

work page arXiv

[43] [43]

Ethan Perez, Sam Ringer, Kamil˙ e Lukoßi¯ ut˙ e, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Benjamin Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Chris Olah, Daisong Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, G R Khundadze, John K...

work page arXiv 2022

[44] [44]

Varbench: Robust language model benchmarking through dynamic variable perturbation.ArXiv, abs/2406.17681, 2024b

15 Kun Qian, Shu Wan, Claudia Tang, Youzhi Wang, Xuanming Zhang, Maximillian Chen, and Zhou Yu. Varbench: Robust language model benchmarking through dynamic variable perturbation.ArXiv, abs/2406.17681, 2024b. https://api.semanticscholar.org/CorpusID:270711329. Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru ...

work page arXiv

[45] [45]

Squad: 100,000+ questions for machine comprehension of text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 2383–2392,

work page 2016

[46] [46]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573,

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

Simworld: An open- ended realistic simulator for autonomous agents in physical and social worlds, 2025

Jiawei Ren, Yan Zhuang, Xiaokang Ye, Lingjun Mao, Xuhong He, Jianzhi Shen, Mrinaal Dogra, Yiming Liang, Ruixuan Zhang, Tianai Yue, et al. Simworld: An open-ended realistic simulator for autonomous agents in physical and social worlds.arXiv preprint arXiv:2512.01078,

work page arXiv

[48] [48]

Beyond accuracy: Behavioral testing of nlp models with checklist.ArXiv, abs/2005.04118, 2020.https://api.semanticscholar.org/CorpusID:218551201

Marco Tulio Ribeiro, Tongshuang Sherry Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of nlp models with checklist.ArXiv, abs/2005.04118, 2020.https://api.semanticscholar.org/CorpusID:218551201. Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris Maddison, and Tatsunori Hashimoto. ...

work page arXiv 2005

[49] [49]

Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark

Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, Julen Etxaniz, Oier López de Lacalle, and Eneko Agirre. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark. InConference on Empirical Methods in Natural Language Processing, 2023.https://api.semanticscholar.org/CorpusID:264555419. Timo Schick, Jane Dwivedi-Yu, Rob...

work page 2023

[50] [50]

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments.arXiv preprint arXiv:2405.07960,

work page internal anchor Pith review arXiv

[51] [51]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[52] [52]

Lean copilot: Large language models as copilots for theorem proving in lean.arXiv preprint arXiv:2404.12534,

Peiyang Song, Kaiyu Yang, and Anima Anandkumar. Lean copilot: Large language models as copilots for theorem proving in lean.arXiv preprint arXiv:2404.12534,

work page arXiv

[53] [53]

Large language model reasoning failures.arXiv preprint arXiv:2602.06176,

Peiyang Song, Pengrui Han, and Noah Goodman. Large language model reasoning failures.arXiv preprint arXiv:2602.06176,

work page arXiv

[54] [54]

Commonsenseqa: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158,

work page 2019

[55] [55]

Creative and context-aware translation of east asian idioms with gpt-4

16 Kenan Tang, Peiyang Song, Yao Qin, and Xifeng Yan. Creative and context-aware translation of east asian idioms with gpt-4. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 9285–9305,

work page 2024

[56] [56]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019.https://arxiv.org/abs/1804.07461. Jiaxuan Wang, Yulan Hu, Wenjin Yang, Zheng Pan, Xin Li, and Lan-Zhe Guo. Aligning agents via planning: A benchmark for trajectory-level reward...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[57] [57]

Battleagentbench: A benchmark for evaluating cooperation and competition capabilities of language models in multi-agent systems.arXiv preprint arXiv:2408.15971,

Wei Wang, Dan Zhang, Tao Feng, Boyan Wang, and Jie Tang. Battleagentbench: A benchmark for evaluating cooperation and competition capabilities of language models in multi-agent systems.arXiv preprint arXiv:2408.15971,

work page arXiv

[58] [58]

MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback.arXiv preprint arXiv:2309.10691, 2023

Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. Mint: Evaluating llms in multi-turn interaction with tools and language feedback.arXiv preprint arXiv:2309.10691,

work page arXiv

[59] [59]

Openhands: An open platform for ai software developers as generalist agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. InInternational Conference on Learning Representations, volume 2025, pages 65882–65919,

work page 2025

[60] [60]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents. arXiv preprint arXiv:2504.12516,

work page internal anchor Pith review Pith/arXiv arXiv

[61] [61]

Livebench: A challenging, contamination-limited llm benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination-limited llm benchmark. InInternationa...

work page 2024

[62] [62]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813,

work page internal anchor Pith review Pith/arXiv arXiv

[63] [63]

Travelplanner: A benchmark for real-world planning with language agents,

Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. Travelplanner: A benchmark for real-world planning with language agents.arXiv preprint arXiv:2402.01622, 2024a. Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osw...

work page arXiv 2024

[64] [64]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Adriano Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.ArXiv, abs/2405.15793,

work page internal anchor Pith review Pith/arXiv arXiv

[65] [65]

ReAct: Synergizing Reasoning and Acting in Language Models

17 Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022a. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in la...

work page internal anchor Pith review Pith/arXiv arXiv

[66] [66]

Automating dataset updates towards reliable and timely evaluation of large language models

Jiahao Ying, Yixin Cao, Yushi Bai, Qianru Sun, Bo Wang, Wei Tang, Zhaojun Ding, Yizhe Yang, Xuanjing Huang, and Shuicheng Yan. Automating dataset updates towards reliable and timely evaluation of large language models. Advances in Neural Information Processing Systems 37, 2024.https://api.semanticscholar.org/CorpusID:267750054. Lance Ying, Ryan Truong, Pr...

work page arXiv 2024

[67] [67]

Interactive Benchmarks

Baoqing Yue, Zihan Zhu, Yifan Zhang, Jichen Feng, Hufei Yang, and Mengdi Wang. Interactive benchmarks.arXiv preprint arXiv:2603.04737,

work page internal anchor Pith review Pith/arXiv arXiv

[68] [68]

Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction.arXiv preprint arXiv:2305.08144,

Danyang Zhang, Zhennan Shen, Rui Xie, Situo Zhang, Tianbao Xie, Zihan Zhao, Siyuan Chen, Lu Chen, Hongshen Xu, Ruisheng Cao, et al. Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction.arXiv preprint arXiv:2305.08144,

work page arXiv

[69] [69]

Yinger Zhang, Shutong Jiang, Renhao Li, Jianhong Tu, Yang Su, Lianghao Deng, Xudong Guo, Chenxu Lv, and Junyang Lin

Kexun Zhang, Yee Choi, Zhenqiao Song, Taiqi He, William Yang Wang, and Lei Li. Hire a linguist!: Learning endangered languages in llms with in-context linguistic descriptions. InFindings of the Association for Computational Linguistics: ACL 2024, pages 15654–15669, 2024a. Yinger Zhang, Shutong Jiang, Renhao Li, Jianhong Tu, Yang Su, Lianghao Deng, Xudong ...

work page arXiv 2024

[70] [70]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. Agent- safetybench: Evaluating the safety of llm agents.arXiv preprint arXiv:2412.14470, 2024b. Qihao Zhao, Yangyu Huang, Tengchao Lv, Lei Cui, Qinzheng Sun, Shaoguang Mao, Xin Zhang, Ying Xin, Qiufeng Yin, Scarlett Li, and Furu Wei. Mmlu-cf: A contamination-f...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[71] [71]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023a. Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, ...

work page internal anchor Pith review Pith/arXiv arXiv

[72] [72]

Needle in the repo: A benchmark for maintainability in ai-generated repository edits, 2026a.https://arxiv.org/abs/2603.27745

18 Haichao Zhu, Qian Zhang, Jiyuan Wang, Zhaorui Yang, and Yuxin Qiu. Needle in the repo: A benchmark for maintainability in ai-generated repository edits, 2026a.https://arxiv.org/abs/2603.27745. Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, and Xing Xie. Dyval: Dynamic evaluation of large language models for reasoning tasks. InInt...

work page arXiv

[73] [73]

Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

https://api.semanticscholar.org/CorpusID:263310319. Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Daisy Zhe Wang, Zhenhailong Wang, Cheng Qian, Robert Tang, Heng Ji, et al. Multiagentbench: Evaluating the collaboration and competition of llm agents. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguisti...

work page arXiv 2024

[74] [74]

A paper is retained if it appears in a top venue, or has citation velocity at least 1.5, or has at least 50 GitHub stars

We then deduplicate papers across channels using arXiv IDs when available and normalized titles otherwise, and apply a shared quality filter to obtain the final candidate set. A paper is retained if it appears in a top venue, or has citation velocity at least 1.5, or has at least 50 GitHub stars. We define citation velocity as CitationVelocity(p) = Citati...

work page 2024

[75] [75]

2016 Reading Comprehension 11,679 — GLUE (Wang et al.,

work page 2016

[76] [76]

2018 Reading Comprehension 10,589 — DROP (Dua et al.,

work page 2018

[77] [77]

2019 Reading Comprehension 1,438 — CommonsenseQA (Talmor et al.,

work page 2019

[78] [78]

2019 Commonsense Reasoning 2,666 168 MMLU (Hendrycks et al.,

work page 2019

[79] [79]

2020 Knowledge & Multitask Reasoning 7,833 1.4k GSM8k (Cobbe et al.,

work page 2020

[80] [80]

2021 Math Reasoning 8,894 1.4k MATH (Hendrycks et al., 2021b) 2021 Math Reasoning 397 433 MiniF2F (Zheng et al.,

work page 2021