arxiv: 2604.06734 · v3 · submitted 2026-04-08 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

TEC: A Collection of Human Trial-and-error Trajectories for Problem Solving

Xinkai Zhang , Jingtao Zhan , Yiqun Liu , Qingyao Ai

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:48 UTC · model grok-4.3

classification 💻 cs.CL

keywords trial-and-errordatasethuman trajectoriesLLM comparisonproblem solvingAI training dataerror reflections

0 comments

The pith

A dataset of human trial-and-error trajectories shows people solve problems more effectively than large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors created a platform to capture full records of how people tackle tasks through repeated attempts, noting every step and their reflections after each error. Data from 46 participants on 58 tasks produced 5,370 trajectories plus reflections across more than 41,000 pages. Direct comparison on the same tasks found humans reaching much higher accuracy rates than LLMs. The resulting collection supplies concrete examples of effective iterative problem solving that current AI systems lack.

Core claim

We introduce the Trial-and-Error Collection (TEC) consisting of 5,370 trajectories and reflections from humans solving 58 tasks. The data shows humans achieve substantially higher accuracy than LLMs, which demonstrates that humans are more effective in trial-and-error than LLMs.

What carries the argument

The TEC annotation platform and dataset that records complete multi-trial trajectories together with post-error reflections on web-based problem-solving tasks.

If this is right

The trajectories can serve as training data for AI systems to acquire more human-like trial-and-error strategies.
The platform supports collection of additional data on new tasks to expand coverage.
Error reflections provide examples that could improve how AI systems respond to and learn from failures.
The dataset establishes a benchmark for measuring future improvements in AI trial-and-error performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extracting common patterns from human reflections could yield targeted techniques to enhance LLM prompting for iterative tasks.
The data may prove especially useful for training models in domains such as coding or research where repeated testing is central.
The observed gap indicates that persistent, self-directed exploration across attempts remains a missing capability in current models.

Load-bearing premise

The selected tasks, participant pool, and LLM testing setup produce a fair and generalizable comparison of trial-and-error effectiveness between humans and models.

What would settle it

Evaluating LLMs on the exact same 58 tasks using a comparable multi-attempt setup with error feedback and measuring whether they reach or surpass human accuracy.

Figures

Figures reproduced from arXiv: 2604.06734 by Jingtao Zhan, Qingyao Ai, Xinkai Zhang, Yiqun Liu.

**Figure 2.** Figure 2: Multi-stage replay-based annotation workflow. On [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Platform interfaces for multi-stage replay-based annotation workflow. (a) Participants browse freely while the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Demographic profile of 46 participants. Each participant completed 4 tutorial questions before the formal study, and used an isolated browser profile without prior personal data for privacy. They tried iteratively and could give up after 5 unsuccessful trials. At least one evidence marker per submission was required. Questions appeared in randomized order. Each question received annotations from 42 partic… view at source ↗

**Figure 5.** Figure 5: Distributions of four key behavioral dimensions. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Conditional probability of corrective plan given er [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Case study: “Who sang Smoke Gets in Your Eyes first?” (answer: Tamara Drasin). Each row shows one method’s [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Query reformulation patterns (GPT-4o-mini). (a) [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

read the original abstract

Trial-and-error is a fundamental strategy for humans to solve complex problems and a necessary capability for Artificial Intelligence (AI) systems operating in real-world environments. Although several trial-and-error AI techniques have recently been proposed, most of them rely on simple heuristics designed by researchers and achieve limited performance gains. The core issue is the absence of appropriate data: current models cannot learn from detailed records of how humans actually conduct trial-and-error in practice. To address this gap, we introduce a data annotation platform and a corresponding dataset, termed Trial-and-Error Collection (TEC). The platform records users' complete trajectories across multiple trials and collects their reflections after receiving error feedback. Using this platform, we record the problem-solving processes of 46 participants on 58 tasks, resulting in 5,370 trial trajectories along with error reflections across 41,229 webpages. With this dataset, we observe that humans achieve substantially higher accuracy compared to LLMs, which demonstrates that humans are more effective in trial-and-error than LLMs. We believe that the TEC platform and dataset provide a valuable foundation for understanding human trial-and-error behavior and for developing more capable AI systems. Platform and dataset are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TEC is a solid new dataset release for human trial-and-error trajectories, but the human-vs-LLM accuracy claim needs clearer setup details to hold up.

read the letter

The paper's real contribution is the TEC platform and the released dataset of 5,370 human trajectories across 58 tasks, complete with error reflections. They collected this from 46 participants and made the whole thing public, which fills a gap in resources that only have final answers or simple heuristics. That part is straightforward and useful on its face. The numbers are concrete and the focus on full iterative paths plus post-error notes is a step beyond most existing collections. Credit for shipping the data and platform without obvious barriers to reuse. The comparison to LLMs is the weaker part. The abstract states humans reached substantially higher accuracy, but it does not describe whether the models received the same multi-turn interface, the same error feedback signals, or equivalent reflection prompts. If the LLM runs were single-pass or lacked the platform's loop, the gap reflects mismatched conditions more than a clear difference in trial-and-error skill. That needs explicit description and controls in the full text. Task selection and participant instructions also matter for how far the data generalizes, though the scale helps. This work is mainly for researchers building AI agents that must learn iterative problem-solving from human examples rather than hand-crafted rules. The dataset itself could support new training approaches even if the comparison section stays limited. I would send it to peer review. The data release is concrete enough to justify referee time, and the methodology questions are fixable with more detail.

Referee Report

1 major / 2 minor

Summary. The paper introduces the Trial-and-Error Collection (TEC) dataset and platform for recording human problem-solving trajectories. 46 participants completed 58 tasks, yielding 5,370 trajectories and post-error reflections across 41,229 webpages. The authors report that humans achieve substantially higher accuracy than LLMs on these tasks and release the data publicly to support research on iterative problem solving.

Significance. The dataset provides a large-scale, publicly available record of detailed human trial-and-error processes with error feedback and reflections, which is a genuine contribution given the scarcity of such data. If the collection protocol is sound, the resource could support training or benchmarking of AI systems on realistic iterative strategies. The concrete collection statistics and public release strengthen the work.

major comments (1)

[§4] §4 (LLM Evaluation): The claim that humans are substantially more effective at trial-and-error than LLMs rests on the reported accuracy gap, yet the manuscript does not specify whether LLMs were evaluated with an equivalent multi-turn interface, the same error signals, the same number of attempts, or reflection prompts matching the human platform. If LLMs received only single-pass or limited prompting, the gap is attributable to mismatched conditions rather than a difference in trial-and-error capability.

minor comments (2)

[§3] The task domains and selection criteria for the 58 problems are described only at a high level; adding a table or appendix listing the tasks with brief descriptions would improve reproducibility.
[§3.2] Participant instructions and the exact wording of the reflection prompts are not quoted verbatim; including them would clarify how reflections were elicited.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the single major comment below and will revise the manuscript accordingly to improve clarity.

read point-by-point responses

Referee: [§4] §4 (LLM Evaluation): The claim that humans are substantially more effective at trial-and-error than LLMs rests on the reported accuracy gap, yet the manuscript does not specify whether LLMs were evaluated with an equivalent multi-turn interface, the same error signals, the same number of attempts, or reflection prompts matching the human platform. If LLMs received only single-pass or limited prompting, the gap is attributable to mismatched conditions rather than a difference in trial-and-error capability.

Authors: We thank the referee for highlighting this important aspect of our evaluation. The LLMs were evaluated using a multi-turn interface that provided the same error signals after each attempt, the same maximum number of attempts per task, and reflection prompts that encouraged analysis of prior errors before the next trial, closely matching the human collection protocol. We acknowledge that the original manuscript did not explicitly describe these matching conditions in sufficient detail. In the revised version, we will expand Section 4 with a precise description of the LLM evaluation setup, including the interface, feedback mechanism, attempt limits, and prompting strategy, to make the comparability transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical data collection with external LLM comparison

full rationale

The paper describes a platform for recording human trial-and-error trajectories on 58 tasks, yielding 5,370 trajectories and reflections. The claim that humans achieve higher accuracy than LLMs is presented as a direct observation from the collected dataset rather than any derivation, fitted parameter, or self-referential prediction. No equations, ansatzes, uniqueness theorems, or model-fitting steps exist in the manuscript. The work is self-contained as a data-release effort whose central empirical finding does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no free parameters, mathematical derivations, or new postulated entities. It rests on the domain assumption that web-based task interactions plus written reflections can faithfully record human trial-and-error behavior.

axioms (1)

domain assumption Web-based task interactions plus written reflections after errors can faithfully record human trial-and-error behavior.
This assumption underpins the design of the annotation platform and the value of the collected trajectories.

pith-pipeline@v0.9.0 · 5514 in / 1223 out tokens · 44701 ms · 2026-05-10T18:48:10.788630+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a data annotation platform and a corresponding dataset, termed Trial-and-Error Collection (TEC). The platform records users' complete trajectories across multiple trials and collects their reflections after receiving error feedback... 5,370 trial trajectories along with error reflections across 41,229 webpages.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and embed_strictMono unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Humans achieve substantially higher accuracy compared to LLMs, which demonstrates that humans are more effective in trial-and-error than LLMs.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 28 canonical work pages · 11 internal anchors

[1]

Nilavra Bhattacharya and Jacek Gwizdka. 2021. YASBIL: Yet Another Search Behaviour (and) Interaction Logger. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(Virtual Event, Canada)(SIGIR ’21). Association for Computing Machinery, New York, NY, USA, 2585–2589. doi:10.1145/3404835.3462800

work page doi:10.1145/3404835.3462800 2021
[2]

Ben Carterette, Paul Clough, Mark Hall, Evangelos Kanoulas, and Mark Sander- son. 2016. Evaluating Retrieval over Sessions: The TREC Session Track 2011-2014. InProceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval(Pisa, Italy)(SIGIR ’16). Association for Com- puting Machinery, New York, NY, USA, 68...

work page doi:10.1145/2911451.2914675 2016
[3]

Hall, and Paul D

Ben Carterette, Evangelos Kanoulas, Mark M. Hall, and Paul D. Clough. 2014. Overview of the TREC 2014 Session Track. InProceedings of The Twenty-Third Text REtrieval Conference, TREC 2014, Gaithersburg, Maryland, USA, November 19-21, 2014 (NIST Special Publication, Vol. 500-308), Ellen M. Voorhees and Angela Ellis (Eds.). National Institute of Standards a...

2014
[4]

Jia Chen, Jiaxin Mao, Yiqun Liu, Fan Zhang, Min Zhang, and Shaoping Ma
[5]

Learning a product relevance model from click-through data in e-commerce,

Towards a Better Understanding of Query Reformulation Behavior in Web Search. InProceedings of the Web Conference 2021(Ljubljana, Slovenia) (WWW ’21). Association for Computing Machinery, New York, NY, USA, 743–755. doi:10.1145/3442381.3450127

work page doi:10.1145/3442381.3450127 2021
[6]

Jia Chen, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2019. TianGong- ST: A New Dataset with Large-scale Refined Real-world Web Search Sessions. InProceedings of the 28th ACM International Conference on Information and Knowledge Management(Beijing, China)(CIKM ’19). Association for Computing Machinery, New York, NY, USA, 2485–2488. doi:10.1145/3357...

work page doi:10.1145/3357384.3358158 2019
[7]

Mingyue Cheng, Jie Ouyang, Shuo Yu, Ruiran Yan, Yucong Luo, Zirui Liu, Daoyu Wang, Qi Liu, and Enhong Chen. 2025. Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning. arXiv:2511.14460 [cs.CL] https://arxiv. org/abs/2511.14460

work page arXiv 2025
[8]

C. Darwin. 1859.On the Origin of Species by Means of Natural Selection, Or, The Preservation of Favoured Races in the Struggle for Life. J. Murray. https: //books.google.co.jp/books?id=jTZbAAAAQAAJ
[9]

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. MIND2WEB: towards a generalist agent for the web. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 1220, 24 pages

2023
[10]

Eugene, serdyukovpv, and Will Cukierski. 2013. Personalized Web Search Challenge. https://kaggle.com/competitions/yandex-personalized-web-search- challenge. Kaggle

2013
[11]

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. 2024. CRITIC: Large Language Models Can Self-Correct with Tool- Interactive Critiquing. arXiv:2305.11738 [cs.CL] https://arxiv.org/abs/2305.11738

work page internal anchor Pith review arXiv 2024
[12]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page doi:10.1038/s41586-025-09422-z 2026
[13]

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. 2024. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Assoc...

work page doi:10.18653/v1/2024.acl-long.371 2024
[14]

Ehsan Kamalloo, Nouha Dziri, Charles L. A. Clarke, and Davood Rafiei. 2023. Evaluating Open-Domain Question Answering in the Era of Large Language Models. arXiv:2305.06984 [cs.CL] https://arxiv.org/abs/2305.06984

work page arXiv 2023
[15]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401 [cs.CL] https://arxiv.org/abs/ 2005.11401

work page internal anchor Pith review arXiv 2021
[16]

Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, Bang Zhang, Xin Zhou, Nan Du, and Jia Li. 2025. S 2R: Teaching LLMs to Self-verify and Self- correct via Reinforcement Learning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and...

2025
[17]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. SELF-REFINE: iterative refinement with self- feedback. InProceedings of the 37th International ...

2023
[18]

David Maxwell and Claudia Hauff. 2021. LogUI: Contemporary Logging In- frastructure for Web-Based Experiments. InAdvances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28 – April 1, 2021, Proceedings, Part II. Springer-Verlag, Berlin, Heidelberg, 525–530. doi:10.1007/978-3-030-72240-1_59

work page doi:10.1007/978-3-030-72240-1_59 2021
[19]

Yeray Mera, Gabriel Rodriguez, and Eugenia Marin-Garcia. 2021. Unraveling the benefits of experiencing errors during learning: Definition, modulating factors, and explanatory theories.Psychonomic Bulletin & Review29 (11 2021). doi:10. 3758/s13423-021-02022-8

2021
[20]

Janet Metcalfe. 2017. Learning from Errors.Annual Review of Psychology68, 1 (2017), 465–489. doi:10.1146/annurev-psych-010416-044022

work page doi:10.1146/annurev-psych-010416-044022 2017
[21]

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. 2023. GAIA: a benchmark for General AI Assistants. arXiv:2311.12983 [cs.CL] https://arxiv.org/abs/2311.12983

work page internal anchor Pith review arXiv 2023
[22]

Matthew Mitsui and Chirag Shah. 2016. Coagmento 2.0: A System for Capturing Individual and Group Information Seeking Behavior. InProceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries(Newark, New Jersey, USA) (JCDL ’16). Association for Computing Machinery, New York, NY, USA, 233–234. doi:10.1145/2910896.2925447

work page doi:10.1145/2910896.2925447 2016
[23]

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. 2022. WebGPT: Browser-assisted question-answering with human feedback. arXiv:2112.09...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

Newell and H.A

A. Newell and H.A. Simon. 2019.Human Problem Solving. Echo Point Books and Media. https://books.google.co.jp/books?id=Gf8EwgEACAAJ

2019
[25]

OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexan- der Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, An- dre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Ko...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Onat Ozer, Grace Wu, Yuchen Wang, Daniel Dosti, Honghao Zhang, and Vivi De La Rue. 2025. MAR:Multi-Agent Reflexion Improves Reasoning Abilities in LLMs. arXiv:2512.20845 [cs.AI] https://arxiv.org/abs/2512.20845

work page arXiv 2025
[27]

Srishti Palani, Zijian Ding, Austin Nguyen, Andrew Chuang, Stephen MacNeil, and Steven P. Dow. 2021. CoNotate: Suggesting Queries Based on Notes Promotes Knowledge Discovery. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems(Yokohama, Japan)(CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 726, 14 page...

work page doi:10.1145/3411764 2021
[28]

1999.All Life is Problem Solving

Karl Popper. 1999.All Life is Problem Solving. Routledge, London. Translated by Patrick Camiller

1999
[29]

Navid Rekabsaz, Oleg Lesota, Markus Schedl, Jon Brassey, and Carsten Eickhoff
[30]

InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(Virtual Event, Canada)(SIGIR ’21)

TripClick: The Log Files of a Large Health Web Search Engine. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(Virtual Event, Canada)(SIGIR ’21). Association for Computing Machinery, New York, NY, USA, 2507–2513. doi:10.1145/3404835. 3463242

work page doi:10.1145/3404835
[31]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: language agents with verbal reinforcement learning. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 377, 19 pages

2023
[32]

Herbert A. Simon. 1978. Information-Processing Theory of Human Problem Solving. https://api.semanticscholar.org/CorpusID:10344827

1978
[33]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Con- gcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Ha...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese
[35]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents. arXiv:2504.12516 [cs.CL] https://arxiv.org/abs/2504.12516

work page internal anchor Pith review arXiv
[36]

Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, and Fei Huang. 2025. Web- Walker: Benchmarking LLMs in Web Traversal. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, a...

work page doi:10.18653/v1/2025.acl-long.508 2025
[37]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL] https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, and Jiecao Chen
[40]

Agent-r: Train- ing language model agents to reflect via iterative self-training.arXiv preprint arXiv:2501.11425, 2025

Agent-R: Training Language Model Agents to Reflect via Iterative Self- Training. arXiv:2501.11425 [cs.AI] https://arxiv.org/abs/2501.11425

work page arXiv
[41]

Jingtao Zhan, Jiahao Zhao, Jiayu Li, Yiqun Liu, Bo Zhang, Qingyao Ai, Jiaxin Mao, Hongning Wang, Min Zhang, and Shaoping Ma. 2025. Evaluating Intelligence via Trial and Error. arXiv:2502.18858 [cs.AI] https://arxiv.org/abs/2502.18858

work page arXiv 2025
[42]

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou
[43]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. arXiv:2506.05176 [cs.CL] https://arxiv.org/abs/2506.05176

work page internal anchor Pith review Pith/arXiv arXiv
[44]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854 [cs.AI] https://arxiv.org/abs/2307.13854

work page internal anchor Pith review arXiv 2024