Interactive Evaluation Requires a Design Science
Pith reviewed 2026-05-20 10:52 UTC · model grok-4.3
The pith
AI evaluation for interactive agents requires a new design science that maps trajectories to judgments of process and system performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Evaluation is an autonomous mapping from evidence to judgments. Interactive evaluation changes both sides of this mapping: the evidence becomes interaction-generated trajectories, while the evaluation procedure must assess process, recoverability, coordination, robustness, and system-level performance. Building on this definition, a two-axis taxonomy organizes the space, design principles and reporting standards are derived, representative scenarios are examined, and longstanding evaluation challenges are shown to reappear at the trajectory level.
What carries the argument
The redefinition of evaluation as an autonomous mapping from evidence to judgments, which structures the two-axis taxonomy for classifying interactive benchmarks and deriving consistent procedures.
Load-bearing premise
Simply adopting previous evaluation paradigms does not suffice for interactive settings, so a new principled design science is required to organize the fragmented landscape of benchmarks.
What would settle it
A demonstration that existing single-response evaluation methods can fully capture, score, and compare interactive agent systems without new principles for trajectories or process assessment would falsify the central claim.
read the original abstract
AI evaluation is undergoing a structural change. Large language models (LLMs) are increasingly deployed as systems that act over time through tools, environments, users, and other agents, while many evaluation practices still inherit assumptions from response-centered benchmarks (e.g., fixed inputs, isolated outputs, and outcome judgments that can be made from a single response). The field has begun to build interactive benchmarks, but the resulting landscape is fragmented: benchmarks differ in what interaction artifacts they admit, how trajectories are scored, and what claims their results support. This position paper argues that interactive evaluation should be treated as a principled evaluation paradigm, not merely a new family of agent benchmarks. Simply adopting previous evaluation paradigms does not suffice. We define evaluation as an autonomous mapping from evidence to judgments, and show that interactive evaluation changes both sides of this mapping: the evidence becomes interaction-generated trajectories, while the evaluation procedure must assess process, recoverability, coordination, robustness, and system-level performance. Building on this definition, we propose a two-axis taxonomy, derive design principles and reporting standards, examine representative scenarios, and analyze how longstanding evaluation challenges reappear at the trajectory level.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This position paper claims that interactive evaluation of AI systems (e.g., LLMs acting over time via tools and environments) constitutes a distinct paradigm requiring a principled design science, rather than adaptations of response-centered benchmarks. It defines evaluation as an autonomous mapping from evidence to judgments, shows that interactive settings replace fixed inputs with trajectories and require assessing process, recoverability, coordination, robustness, and system-level performance, and proposes a two-axis taxonomy, design principles, reporting standards, representative scenarios, and re-examination of longstanding challenges at the trajectory level.
Significance. If the argument holds, the paper provides a coherent conceptual reorganization of a fragmented area, with the definitional grounding and logical derivation of changed evidence/judgment requirements as clear strengths. It could guide more comparable and principled benchmark development for agentic systems. The contribution is primarily organizational rather than empirical; its influence would increase if the taxonomy demonstrably clarifies existing work.
major comments (1)
- The section proposing the two-axis taxonomy: the claim that this taxonomy (and associated design principles) addresses fragmentation is central, yet the manuscript provides no concrete application re-classifying or re-analyzing even two or three existing interactive benchmarks to illustrate how the axes clarify differences in admitted artifacts, scoring, or supported claims.
minor comments (1)
- The dimensions of evaluation (process, recoverability, coordination, robustness, system-level performance) are listed without explicit definitions or short illustrative examples in the main text, which reduces clarity for readers new to trajectory-based assessment.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and the constructive suggestion regarding the taxonomy section. We agree that concrete illustrations will strengthen the paper's ability to demonstrate how the proposed framework organizes the field.
read point-by-point responses
-
Referee: The section proposing the two-axis taxonomy: the claim that this taxonomy (and associated design principles) addresses fragmentation is central, yet the manuscript provides no concrete application re-classifying or re-analyzing even two or three existing interactive benchmarks to illustrate how the axes clarify differences in admitted artifacts, scoring, or supported claims.
Authors: We accept this observation. The taxonomy is intended to provide a principled lens for comparing interactive evaluations, but without explicit re-applications the claim remains abstract. In the revised version we add a dedicated subsection that applies both axes to re-classify three existing benchmarks (WebArena, ToolBench, and a representative multi-agent coordination environment). For each benchmark we explicitly map the admitted interaction artifacts, the scoring procedures used, and the system-level claims that can be supported, thereby showing how the taxonomy surfaces previously unarticulated differences and reduces fragmentation. revision: yes
Circularity Check
No significant circularity in definitional position paper
full rationale
The paper's argument begins with an explicit definition of evaluation as an autonomous mapping from evidence to judgments and logically extends this to interactive settings by observing that evidence shifts to trajectories and that assessment must incorporate process, recoverability, coordination, robustness, and system-level performance. This extension is presented as a direct consequence of the chosen definition rather than a reduction to fitted inputs, self-citations, or prior results. The subsequent proposal of a two-axis taxonomy, design principles, and reporting standards follows as an organizational framework for addressing fragmentation, without any load-bearing step that equates a claimed derivation to its own inputs by construction. No equations, parameter fits, or uniqueness theorems are invoked in a manner that creates circularity; the contribution remains self-contained as a call for principled structure.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Evaluation is an autonomous mapping from evidence to judgments.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We define evaluation as an autonomous mapping from evidence to judgments... interactive evaluation changes both sides of this mapping: the evidence becomes interaction-generated trajectories, while the evaluation procedure must assess process, recoverability, coordination, robustness...
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
two-axis taxonomy... Axis 1: Evaluation Inputs... Axis 2: Evaluation Programs
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Sahar Abdelnabi, Amr Gomaa, Sarath Sivaprasad, Lea Schonherr, and Mario Fritz. Cooperation, competition, and maliciousness: Llm-stakeholders interactive negotiation.Advances in Neural Information Processing Systems 37, 2023.https://api.semanticscholar.org/CorpusID:263310628. Melissa Ailem, Katerina Marazopoulou, Charlotte Siska, and James Bono. Examining ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Haoyue Bai, Yiyou Sun, Wenjie Hu, Shi Qiu, Maggie Ziyu Huan, Peiyang Song, Robert Nowak, and Dawn Song. How and why llms generalize: A fine-grained analysis of llm reasoning from cognitive behaviors to low-level patterns. arXiv preprint arXiv:2512.24063,
-
[3]
Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, and Ondrej Dusek. Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms.ArXiv, abs/2402.03927, 2024.https://api.semanticscholar.org/ CorpusID:267499939. Samuel Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning na...
-
[4]
clembench: Using game play to evaluate chat-optimized language models as conversational agents
Kranti Chalamalasetti, Jana Götze, Sherzod Hakimov, Brielen Madureira, Philipp Sadler, and David Schlangen. clembench: Using game play to evaluate chat-optimized language models as conversational agents. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 11174–11219,
work page 2023
-
[5]
Nishanth Chandran, Sunayana Sitaram, Divya Gupta, Rahul Sharma, Kashish Mittal, and Manohar Swaminathan. Private benchmarking to prevent contamination and improve comparative evaluation of llms.ArXiv, abs/2403.00393, 2024.https://api.semanticscholar.org/CorpusID:268201479. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto,...
-
[6]
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference.arXiv preprint arXiv:2403.04132,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Ai impact on human proof formalization workflows
Katherine M Collins, Simon Frieder, Jonas Bayer, Jacob Loader, Jeck Lim, Peiyang Song, Fabian Zaiser, Lexin Zhou, Shanda Li, Shi-Zhuo Looi, et al. Ai impact on human proof formalization workflows. InThe 5th Workshop on Mathematical Reasoning and AI at NeurIPS 2025,
work page 2025
-
[9]
11 Tim R. Davidson, Veniamin Veselovsky, Martin Josifoski, Maxime Peyrard, Antoine Bosselut, Michal Kosinski, and Robert West. Evaluating language model agency through negotiations.ArXiv, abs/2401.04536,
-
[10]
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark B. Gerstein, and Arman Cohan. Investigating data contamination in modern benchmarks for large language models. InNorth American Chapter of the Association for Computational Linguistics, 2023a.https://api.semanticscholar.org/CorpusID:265220695. Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wa...
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs, 2019.https://arxiv.org/abs/1903.00161. Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evalu...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[12]
Maseval: Extending multi-agent evaluation from models to systems.arXiv preprint arXiv:2603.08835,
Cornelius Emde, Alexander Rubinstein, Anmol Goel, Ahmed Heakl, Sangdoo Yun, Seong Joon Oh, and Martin Gubri. Maseval: Extending multi-agent evaluation from models to systems.arXiv preprint arXiv:2603.08835,
-
[13]
Shengda Fan, Xuyan Ye, Yupeng Huo, Zhi-Yuan Chen, Yiju Guo, Shenzhi Yang, Wenkai Yang, Shuqi Ye, Jingwen Chen, Haotian Chen, Xin Cong, and Yankai Lin. Agentprocessbench: Diagnosing step-level process quality in tool-using agents, 2026.https://arxiv.org/abs/2603.14465. Yukang Feng, Jianwen Sun, Zelai Yang, Jiaxin Ai, Chuanhao Li, Zizhen Li, Fanrui Zhang, K...
-
[14]
ARC Prize Foundation. Arc-agi-3: A new challenge for frontier agentic intelligence, 2026.https://arxiv.org/abs/2603. 24621. Romain Froger, Pierre Andrews, Matteo Bettini, Amar Budhiraja, Ricardo Silveira Cabral, Virginie Do, Emilien Garreau, Jean-Baptiste Gaya, Hugo Laurençon, Maxime Lecanu, et al. Are: Scaling up agent environments and evaluations.arXiv ...
-
[15]
Omni-math: A universal olympiad level mathematic benchmark for large language models
Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Zhengyang Tang, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models. In International Conference on Learning Representations, volume 2025, pages 100540–100569,
work page 2025
-
[16]
Robert Joseph George, Suozhi Huang, Peiyang Song, and Anima Anandkumar. Leanprogress: Guiding search for neural theorem proving via proof progress prediction.arXiv preprint arXiv:2502.17925,
-
[17]
Builderbench–a benchmark for generalist agents.arXiv preprint arXiv:2510.06288,
Raj Ghugare, Catherine Ji, Kathryn Wantlin, Jin Schofield, and Benjamin Eysenbach. Builderbench–a benchmark for generalist agents.arXiv preprint arXiv:2510.06288,
-
[18]
Shahriar Golchin and Mihai Surdeanu. Time travel in llms: Tracing data contamination in large language models. ArXiv, abs/2308.08493, 2023.https://api.semanticscholar.org/CorpusID:260925501. 12 Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc’Aurelio Ranzato, Francisco Guzmán, and Angela Fan. The ...
-
[19]
Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models
Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu. Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11143–11156,
work page 2024
-
[20]
Pengrui Han, Peiyang Song, Haofei Yu, and Jiaxuan You. In-context learning may not elicit trustworthy reasoning: A-not-b errors in pretrained language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 5624–5643,
work page 2024
-
[21]
Pengrui Han, Rafal Kocielnik, Peiyang Song, Ramit Debnath, Dean Mobbs, Anima Anandkumar, and R Michael Alvarez. The personality illusion: Revealing dissociation between self-reports & behavior in llms.arXiv preprint arXiv:2509.03730,
-
[22]
Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, et al. Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks.arXiv preprint arXiv:2602.16313,
-
[23]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[24]
Measuring Coding Challenge Competence With APPS
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Xiaodong Song, and Jacob Steinhardt. Measuring coding challenge competence with apps.ArXiv, abs/2105.09938, 2021a.https://api.semanticscholar.org/CorpusID:234790100. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, St...
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Alon Jacovi, Avi Caciularu, Omer Goldman, and Yoav Goldberg. Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks. InConference on Empirical Methods in Natural Language Processing, 2023.https://api.semanticscholar.org/CorpusID:258741333. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Y...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, et al. Mrkl systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning.arXiv preprint arXiv:2205.00445,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Smith, Yejin Choi, and Kentaro Inui
Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi, Ronan Le Bras, Akari Asai, Xinyan Velocity Yu, Dragomir R. Radev, Noah A. Smith, Yejin Choi, and Kentaro Inui. Realtime qa: What’s the answer right now?ArXiv, abs/2207.13332, 2022.https://api.semanticscholar.org/CorpusID:251105205. Arpandeep Khatua, Hao Zhu, Peter Tran, Arya Prabhudesai, Frederic Sadrieh, ...
-
[28]
Dynabench: Rethinking benchmarking in NLP
Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Talat, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. Dynabench: Rethinking benchmarking in nlp.ArXiv, abs/2104.14337, 202...
-
[29]
Booksum: A collection of datasets for long-form narrative summarization
Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, and Dragomir Radev. Booksum: A collection of datasets for long-form narrative summarization. InFindings of the association for computational linguistics: EMNLP 2022, pages 6536–6558,
work page 2022
-
[30]
Leanagent: Lifelong learning for formal theorem proving
Adarsh Kumarappan, Mohit Tiwari, Peiyang Song, Robert Joseph George, Chaowei Xiao, et al. Leanagent: Lifelong learning for formal theorem proving. InInternational Conference on Learning Representations, volume 2025, pages 73525–73564,
work page 2025
-
[31]
Evaluating human-language model interaction.arXiv preprint arXiv:2212.09746,
Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus, Ashwin Paranjape, Ines Gerard-Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda Rong, et al. Evaluating human-language model interaction.arXiv preprint arXiv:2212.09746,
-
[32]
Elad Levi and Ilan Kadar. Intellagent: A multi-agent framework for evaluating conversational ai systems.arXiv preprint arXiv:2501.11067,
-
[33]
Deal or No Deal? End-to-End Learning for Negotiation Dialogues
Mike Lewis, Denis Yarats, Yann Dauphin, Devi Parikh, and Dhruv Batra. Deal or no deal? end-to-end learning of negotiation dialogues.ArXiv, abs/1706.05125, 2017.https://api.semanticscholar.org/CorpusID:2454882. Dawei Li, Yuguang Yao, Zhen Tan, Huan Liu, and Ruocheng Guo. Toolprmbench: Evaluating and advancing process reward models for tool-using agents.ArX...
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[34]
Jiatong Li, Rui Li, and Qi Liu
https://api.semanticscholar.org/CorpusID: 284910432. Jiatong Li, Rui Li, and Qi Liu. Beyond static datasets: A deep interaction approach to llm evaluation, 2023a. https://arxiv.org/abs/2309.04369. Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augm...
-
[35]
Manning, Christopher R’e, Diana Acosta-Navas, Drew A
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher R’e, Diana Acosta-Navas, Drew A. Hudson, E. Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu ...
work page 2023
-
[36]
Agentbench: Evaluating llms as agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representations, volume 2024, pages 52989–53046,
work page 2024
-
[37]
Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, et al. Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 1160–1183,
work page 2025
-
[38]
Evaluating Very Long-Term Conversational Memory of LLM Agents
Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents, 2024.https://arxiv.org/abs/2402.17753. David Manheim and Scott Garrabrant. Categorizing variants of goodhart’s law.ArXiv, abs/1803.04585,
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
https://api.semanticscholar.org/CorpusID:4715794. R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference.ArXiv, abs/1902.01007, 2019.https://api.semanticscholar.org/CorpusID:59599752. Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich,...
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[40]
Zairah Mustahsan, Abel Lim, Megna Anand, Saahil Jain, and Bryan McCann. Stochasticity in agentic evaluations: Quantifying inconsistency with intraclass correlation.arXiv preprint arXiv:2512.06710,
-
[41]
Efficient benchmarking of ai agents.arXiv preprint arXiv:2603.23749,
Franck Ndzomga. Efficient benchmarking of ai agents.arXiv preprint arXiv:2603.23749,
-
[42]
Chatterji, Faisal Ladhak, and Tatsunori Hashimoto
Yonatan Oren, Nicole Meister, Niladri S. Chatterji, Faisal Ladhak, and Tatsunori Hashimoto. Proving test set contamination in black box language models.ArXiv, abs/2310.17623,
-
[43]
Ethan Perez, Sam Ringer, Kamil˙ e Lukoßi¯ ut˙ e, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Benjamin Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Chris Olah, Daisong Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, G R Khundadze, John K...
-
[44]
15 Kun Qian, Shu Wan, Claudia Tang, Youzhi Wang, Xuanming Zhang, Maximillian Chen, and Zhou Yu. Varbench: Robust language model benchmarking through dynamic variable perturbation.ArXiv, abs/2406.17681, 2024b. https://api.semanticscholar.org/CorpusID:270711329. Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru ...
-
[45]
Squad: 100,000+ questions for machine comprehension of text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 2383–2392,
work page 2016
-
[46]
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573,
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
Jiawei Ren, Yan Zhuang, Xiaokang Ye, Lingjun Mao, Xuhong He, Jianzhi Shen, Mrinaal Dogra, Yiming Liang, Ruixuan Zhang, Tianai Yue, et al. Simworld: An open-ended realistic simulator for autonomous agents in physical and social worlds.arXiv preprint arXiv:2512.01078,
-
[48]
Marco Tulio Ribeiro, Tongshuang Sherry Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of nlp models with checklist.ArXiv, abs/2005.04118, 2020.https://api.semanticscholar.org/CorpusID:218551201. Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris Maddison, and Tatsunori Hashimoto. ...
-
[49]
Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark
Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, Julen Etxaniz, Oier López de Lacalle, and Eneko Agirre. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark. InConference on Empirical Methods in Natural Language Processing, 2023.https://api.semanticscholar.org/CorpusID:264555419. Timo Schick, Jane Dwivedi-Yu, Rob...
work page 2023
-
[50]
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments
Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments.arXiv preprint arXiv:2405.07960,
work page internal anchor Pith review arXiv
-
[51]
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[52]
Peiyang Song, Kaiyu Yang, and Anima Anandkumar. Lean copilot: Large language models as copilots for theorem proving in lean.arXiv preprint arXiv:2404.12534,
-
[53]
Large language model reasoning failures.arXiv preprint arXiv:2602.06176,
Peiyang Song, Pengrui Han, and Noah Goodman. Large language model reasoning failures.arXiv preprint arXiv:2602.06176,
-
[54]
Commonsenseqa: A question answering challenge targeting commonsense knowledge
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158,
work page 2019
-
[55]
Creative and context-aware translation of east asian idioms with gpt-4
16 Kenan Tang, Peiyang Song, Yao Qin, and Xifeng Yan. Creative and context-aware translation of east asian idioms with gpt-4. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 9285–9305,
work page 2024
-
[56]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019.https://arxiv.org/abs/1804.07461. Jiaxuan Wang, Yulan Hu, Wenjin Yang, Zheng Pan, Xin Li, and Lan-Zhe Guo. Aligning agents via planning: A benchmark for trajectory-level reward...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[57]
Wei Wang, Dan Zhang, Tao Feng, Boyan Wang, and Jie Tang. Battleagentbench: A benchmark for evaluating cooperation and competition capabilities of language models in multi-agent systems.arXiv preprint arXiv:2408.15971,
-
[58]
Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. Mint: Evaluating llms in multi-turn interaction with tools and language feedback.arXiv preprint arXiv:2309.10691,
-
[59]
Openhands: An open platform for ai software developers as generalist agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. InInternational Conference on Learning Representations, volume 2025, pages 65882–65919,
work page 2025
-
[60]
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents. arXiv preprint arXiv:2504.12516,
work page internal anchor Pith review Pith/arXiv arXiv
-
[61]
Livebench: A challenging, contamination-limited llm benchmark
Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination-limited llm benchmark. InInternationa...
work page 2024
-
[62]
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813,
work page internal anchor Pith review Pith/arXiv arXiv
-
[63]
Travelplanner: A benchmark for real-world planning with language agents,
Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. Travelplanner: A benchmark for real-world planning with language agents.arXiv preprint arXiv:2402.01622, 2024a. Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osw...
-
[64]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Adriano Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.ArXiv, abs/2405.15793,
work page internal anchor Pith review Pith/arXiv arXiv
-
[65]
ReAct: Synergizing Reasoning and Acting in Language Models
17 Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022a. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in la...
work page internal anchor Pith review Pith/arXiv arXiv
-
[66]
Automating dataset updates towards reliable and timely evaluation of large language models
Jiahao Ying, Yixin Cao, Yushi Bai, Qianru Sun, Bo Wang, Wei Tang, Zhaojun Ding, Yizhe Yang, Xuanjing Huang, and Shuicheng Yan. Automating dataset updates towards reliable and timely evaluation of large language models. Advances in Neural Information Processing Systems 37, 2024.https://api.semanticscholar.org/CorpusID:267750054. Lance Ying, Ryan Truong, Pr...
-
[67]
Baoqing Yue, Zihan Zhu, Yifan Zhang, Jichen Feng, Hufei Yang, and Mengdi Wang. Interactive benchmarks.arXiv preprint arXiv:2603.04737,
work page internal anchor Pith review Pith/arXiv arXiv
-
[68]
Danyang Zhang, Zhennan Shen, Rui Xie, Situo Zhang, Tianbao Xie, Zihan Zhao, Siyuan Chen, Lu Chen, Hongshen Xu, Ruisheng Cao, et al. Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction.arXiv preprint arXiv:2305.08144,
-
[69]
Kexun Zhang, Yee Choi, Zhenqiao Song, Taiqi He, William Yang Wang, and Lei Li. Hire a linguist!: Learning endangered languages in llms with in-context linguistic descriptions. InFindings of the Association for Computational Linguistics: ACL 2024, pages 15654–15669, 2024a. Yinger Zhang, Shutong Jiang, Renhao Li, Jianhong Tu, Yang Su, Lianghao Deng, Xudong ...
-
[70]
Agent-SafetyBench: Evaluating the Safety of LLM Agents
Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. Agent- safetybench: Evaluating the safety of llm agents.arXiv preprint arXiv:2412.14470, 2024b. Qihao Zhao, Yangyu Huang, Tengchao Lv, Lei Cui, Qinzheng Sun, Shaoguang Mao, Xin Zhang, Ying Xin, Qiufeng Yin, Scarlett Li, and Furu Wei. Mmlu-cf: A contamination-f...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[71]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023a. Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[72]
18 Haichao Zhu, Qian Zhang, Jiyuan Wang, Zhaorui Yang, and Yuxin Qiu. Needle in the repo: A benchmark for maintainability in ai-generated repository edits, 2026a.https://arxiv.org/abs/2603.27745. Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, and Xing Xie. Dyval: Dynamic evaluation of large language models for reasoning tasks. InInt...
-
[73]
Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025
https://api.semanticscholar.org/CorpusID:263310319. Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Daisy Zhe Wang, Zhenhailong Wang, Cheng Qian, Robert Tang, Heng Ji, et al. Multiagentbench: Evaluating the collaboration and competition of llm agents. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguisti...
-
[74]
We then deduplicate papers across channels using arXiv IDs when available and normalized titles otherwise, and apply a shared quality filter to obtain the final candidate set. A paper is retained if it appears in a top venue, or has citation velocity at least 1.5, or has at least 50 GitHub stars. We define citation velocity as CitationVelocity(p) = Citati...
work page 2024
-
[75]
2016 Reading Comprehension 11,679 — GLUE (Wang et al.,
work page 2016
-
[76]
2018 Reading Comprehension 10,589 — DROP (Dua et al.,
work page 2018
-
[77]
2019 Reading Comprehension 1,438 — CommonsenseQA (Talmor et al.,
work page 2019
-
[78]
2019 Commonsense Reasoning 2,666 168 MMLU (Hendrycks et al.,
work page 2019
-
[79]
2020 Knowledge & Multitask Reasoning 7,833 1.4k GSM8k (Cobbe et al.,
work page 2020
-
[80]
2021 Math Reasoning 8,894 1.4k MATH (Hendrycks et al., 2021b) 2021 Math Reasoning 397 433 MiniF2F (Zheng et al.,
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.