pith. machine review for the scientific record. sign in

arxiv: 2604.06734 · v3 · submitted 2026-04-08 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

TEC: A Collection of Human Trial-and-error Trajectories for Problem Solving

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:48 UTC · model grok-4.3

classification 💻 cs.CL
keywords trial-and-errordatasethuman trajectoriesLLM comparisonproblem solvingAI training dataerror reflections
0
0 comments X

The pith

A dataset of human trial-and-error trajectories shows people solve problems more effectively than large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors created a platform to capture full records of how people tackle tasks through repeated attempts, noting every step and their reflections after each error. Data from 46 participants on 58 tasks produced 5,370 trajectories plus reflections across more than 41,000 pages. Direct comparison on the same tasks found humans reaching much higher accuracy rates than LLMs. The resulting collection supplies concrete examples of effective iterative problem solving that current AI systems lack.

Core claim

We introduce the Trial-and-Error Collection (TEC) consisting of 5,370 trajectories and reflections from humans solving 58 tasks. The data shows humans achieve substantially higher accuracy than LLMs, which demonstrates that humans are more effective in trial-and-error than LLMs.

What carries the argument

The TEC annotation platform and dataset that records complete multi-trial trajectories together with post-error reflections on web-based problem-solving tasks.

If this is right

  • The trajectories can serve as training data for AI systems to acquire more human-like trial-and-error strategies.
  • The platform supports collection of additional data on new tasks to expand coverage.
  • Error reflections provide examples that could improve how AI systems respond to and learn from failures.
  • The dataset establishes a benchmark for measuring future improvements in AI trial-and-error performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extracting common patterns from human reflections could yield targeted techniques to enhance LLM prompting for iterative tasks.
  • The data may prove especially useful for training models in domains such as coding or research where repeated testing is central.
  • The observed gap indicates that persistent, self-directed exploration across attempts remains a missing capability in current models.

Load-bearing premise

The selected tasks, participant pool, and LLM testing setup produce a fair and generalizable comparison of trial-and-error effectiveness between humans and models.

What would settle it

Evaluating LLMs on the exact same 58 tasks using a comparable multi-attempt setup with error feedback and measuring whether they reach or surpass human accuracy.

Figures

Figures reproduced from arXiv: 2604.06734 by Jingtao Zhan, Qingyao Ai, Xinkai Zhang, Yiqun Liu.

Figure 1
Figure 1. Figure 1: Platform architecture. The Chrome extension cap [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Multi-stage replay-based annotation workflow. On [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Platform interfaces for multi-stage replay-based annotation workflow. (a) Participants browse freely while the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Demographic profile of 46 participants. Each participant completed 4 tutorial questions before the formal study, and used an isolated browser profile without prior personal data for privacy. They tried iteratively and could give up after 5 un￾successful trials. At least one evidence marker per submission was required. Questions appeared in randomized order. Each question received annotations from 42 partic… view at source ↗
Figure 5
Figure 5. Figure 5: Distributions of four key behavioral dimensions. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Conditional probability of corrective plan given er [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Case study: “Who sang Smoke Gets in Your Eyes first?” (answer: Tamara Drasin). Each row shows one method’s [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Query reformulation patterns (GPT-4o-mini). (a) [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
read the original abstract

Trial-and-error is a fundamental strategy for humans to solve complex problems and a necessary capability for Artificial Intelligence (AI) systems operating in real-world environments. Although several trial-and-error AI techniques have recently been proposed, most of them rely on simple heuristics designed by researchers and achieve limited performance gains. The core issue is the absence of appropriate data: current models cannot learn from detailed records of how humans actually conduct trial-and-error in practice. To address this gap, we introduce a data annotation platform and a corresponding dataset, termed Trial-and-Error Collection (TEC). The platform records users' complete trajectories across multiple trials and collects their reflections after receiving error feedback. Using this platform, we record the problem-solving processes of 46 participants on 58 tasks, resulting in 5,370 trial trajectories along with error reflections across 41,229 webpages. With this dataset, we observe that humans achieve substantially higher accuracy compared to LLMs, which demonstrates that humans are more effective in trial-and-error than LLMs. We believe that the TEC platform and dataset provide a valuable foundation for understanding human trial-and-error behavior and for developing more capable AI systems. Platform and dataset are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces the Trial-and-Error Collection (TEC) dataset and platform for recording human problem-solving trajectories. 46 participants completed 58 tasks, yielding 5,370 trajectories and post-error reflections across 41,229 webpages. The authors report that humans achieve substantially higher accuracy than LLMs on these tasks and release the data publicly to support research on iterative problem solving.

Significance. The dataset provides a large-scale, publicly available record of detailed human trial-and-error processes with error feedback and reflections, which is a genuine contribution given the scarcity of such data. If the collection protocol is sound, the resource could support training or benchmarking of AI systems on realistic iterative strategies. The concrete collection statistics and public release strengthen the work.

major comments (1)
  1. [§4] §4 (LLM Evaluation): The claim that humans are substantially more effective at trial-and-error than LLMs rests on the reported accuracy gap, yet the manuscript does not specify whether LLMs were evaluated with an equivalent multi-turn interface, the same error signals, the same number of attempts, or reflection prompts matching the human platform. If LLMs received only single-pass or limited prompting, the gap is attributable to mismatched conditions rather than a difference in trial-and-error capability.
minor comments (2)
  1. [§3] The task domains and selection criteria for the 58 problems are described only at a high level; adding a table or appendix listing the tasks with brief descriptions would improve reproducibility.
  2. [§3.2] Participant instructions and the exact wording of the reflection prompts are not quoted verbatim; including them would clarify how reflections were elicited.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the single major comment below and will revise the manuscript accordingly to improve clarity.

read point-by-point responses
  1. Referee: [§4] §4 (LLM Evaluation): The claim that humans are substantially more effective at trial-and-error than LLMs rests on the reported accuracy gap, yet the manuscript does not specify whether LLMs were evaluated with an equivalent multi-turn interface, the same error signals, the same number of attempts, or reflection prompts matching the human platform. If LLMs received only single-pass or limited prompting, the gap is attributable to mismatched conditions rather than a difference in trial-and-error capability.

    Authors: We thank the referee for highlighting this important aspect of our evaluation. The LLMs were evaluated using a multi-turn interface that provided the same error signals after each attempt, the same maximum number of attempts per task, and reflection prompts that encouraged analysis of prior errors before the next trial, closely matching the human collection protocol. We acknowledge that the original manuscript did not explicitly describe these matching conditions in sufficient detail. In the revised version, we will expand Section 4 with a precise description of the LLM evaluation setup, including the interface, feedback mechanism, attempt limits, and prompting strategy, to make the comparability transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical data collection with external LLM comparison

full rationale

The paper describes a platform for recording human trial-and-error trajectories on 58 tasks, yielding 5,370 trajectories and reflections. The claim that humans achieve higher accuracy than LLMs is presented as a direct observation from the collected dataset rather than any derivation, fitted parameter, or self-referential prediction. No equations, ansatzes, uniqueness theorems, or model-fitting steps exist in the manuscript. The work is self-contained as a data-release effort whose central empirical finding does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no free parameters, mathematical derivations, or new postulated entities. It rests on the domain assumption that web-based task interactions plus written reflections can faithfully record human trial-and-error behavior.

axioms (1)
  • domain assumption Web-based task interactions plus written reflections after errors can faithfully record human trial-and-error behavior.
    This assumption underpins the design of the annotation platform and the value of the collected trajectories.

pith-pipeline@v0.9.0 · 5514 in / 1223 out tokens · 44701 ms · 2026-05-10T18:48:10.788630+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 28 canonical work pages · 11 internal anchors

  1. [1]

    Nilavra Bhattacharya and Jacek Gwizdka. 2021. YASBIL: Yet Another Search Behaviour (and) Interaction Logger. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(Virtual Event, Canada)(SIGIR ’21). Association for Computing Machinery, New York, NY, USA, 2585–2589. doi:10.1145/3404835.3462800

  2. [2]

    Ben Carterette, Paul Clough, Mark Hall, Evangelos Kanoulas, and Mark Sander- son. 2016. Evaluating Retrieval over Sessions: The TREC Session Track 2011-2014. InProceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval(Pisa, Italy)(SIGIR ’16). Association for Com- puting Machinery, New York, NY, USA, 68...

  3. [3]

    Hall, and Paul D

    Ben Carterette, Evangelos Kanoulas, Mark M. Hall, and Paul D. Clough. 2014. Overview of the TREC 2014 Session Track. InProceedings of The Twenty-Third Text REtrieval Conference, TREC 2014, Gaithersburg, Maryland, USA, November 19-21, 2014 (NIST Special Publication, Vol. 500-308), Ellen M. Voorhees and Angela Ellis (Eds.). National Institute of Standards a...

  4. [4]

    Jia Chen, Jiaxin Mao, Yiqun Liu, Fan Zhang, Min Zhang, and Shaoping Ma

  5. [5]

    Learning a product relevance model from click-through data in e-commerce,

    Towards a Better Understanding of Query Reformulation Behavior in Web Search. InProceedings of the Web Conference 2021(Ljubljana, Slovenia) (WWW ’21). Association for Computing Machinery, New York, NY, USA, 743–755. doi:10.1145/3442381.3450127

  6. [6]

    Jia Chen, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2019. TianGong- ST: A New Dataset with Large-scale Refined Real-world Web Search Sessions. InProceedings of the 28th ACM International Conference on Information and Knowledge Management(Beijing, China)(CIKM ’19). Association for Computing Machinery, New York, NY, USA, 2485–2488. doi:10.1145/3357...

  7. [7]

    Mingyue Cheng, Jie Ouyang, Shuo Yu, Ruiran Yan, Yucong Luo, Zirui Liu, Daoyu Wang, Qi Liu, and Enhong Chen. 2025. Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning. arXiv:2511.14460 [cs.CL] https://arxiv. org/abs/2511.14460

  8. [8]

    C. Darwin. 1859.On the Origin of Species by Means of Natural Selection, Or, The Preservation of Favoured Races in the Struggle for Life. J. Murray. https: //books.google.co.jp/books?id=jTZbAAAAQAAJ

  9. [9]

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. MIND2WEB: towards a generalist agent for the web. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 1220, 24 pages

  10. [10]

    Eugene, serdyukovpv, and Will Cukierski. 2013. Personalized Web Search Challenge. https://kaggle.com/competitions/yandex-personalized-web-search- challenge. Kaggle

  11. [11]

    Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. 2024. CRITIC: Large Language Models Can Self-Correct with Tool- Interactive Critiquing. arXiv:2305.11738 [cs.CL] https://arxiv.org/abs/2305.11738

  12. [12]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

  13. [13]

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. 2024. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Assoc...

  14. [14]

    Ehsan Kamalloo, Nouha Dziri, Charles L. A. Clarke, and Davood Rafiei. 2023. Evaluating Open-Domain Question Answering in the Era of Large Language Models. arXiv:2305.06984 [cs.CL] https://arxiv.org/abs/2305.06984

  15. [15]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401 [cs.CL] https://arxiv.org/abs/ 2005.11401

  16. [16]

    Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, Bang Zhang, Xin Zhou, Nan Du, and Jia Li. 2025. S 2R: Teaching LLMs to Self-verify and Self- correct via Reinforcement Learning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and...

  17. [17]

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. SELF-REFINE: iterative refinement with self- feedback. InProceedings of the 37th International ...

  18. [18]

    David Maxwell and Claudia Hauff. 2021. LogUI: Contemporary Logging In- frastructure for Web-Based Experiments. InAdvances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28 – April 1, 2021, Proceedings, Part II. Springer-Verlag, Berlin, Heidelberg, 525–530. doi:10.1007/978-3-030-72240-1_59

  19. [19]

    Yeray Mera, Gabriel Rodriguez, and Eugenia Marin-Garcia. 2021. Unraveling the benefits of experiencing errors during learning: Definition, modulating factors, and explanatory theories.Psychonomic Bulletin & Review29 (11 2021). doi:10. 3758/s13423-021-02022-8

  20. [20]

    Janet Metcalfe. 2017. Learning from Errors.Annual Review of Psychology68, 1 (2017), 465–489. doi:10.1146/annurev-psych-010416-044022

  21. [21]

    Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. 2023. GAIA: a benchmark for General AI Assistants. arXiv:2311.12983 [cs.CL] https://arxiv.org/abs/2311.12983

  22. [22]

    Matthew Mitsui and Chirag Shah. 2016. Coagmento 2.0: A System for Capturing Individual and Group Information Seeking Behavior. InProceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries(Newark, New Jersey, USA) (JCDL ’16). Association for Computing Machinery, New York, NY, USA, 233–234. doi:10.1145/2910896.2925447

  23. [23]

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. 2022. WebGPT: Browser-assisted question-answering with human feedback. arXiv:2112.09...

  24. [24]

    Newell and H.A

    A. Newell and H.A. Simon. 2019.Human Problem Solving. Echo Point Books and Media. https://books.google.co.jp/books?id=Gf8EwgEACAAJ

  25. [25]

    OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexan- der Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, An- dre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Ko...

  26. [26]

    Onat Ozer, Grace Wu, Yuchen Wang, Daniel Dosti, Honghao Zhang, and Vivi De La Rue. 2025. MAR:Multi-Agent Reflexion Improves Reasoning Abilities in LLMs. arXiv:2512.20845 [cs.AI] https://arxiv.org/abs/2512.20845

  27. [27]

    Srishti Palani, Zijian Ding, Austin Nguyen, Andrew Chuang, Stephen MacNeil, and Steven P. Dow. 2021. CoNotate: Suggesting Queries Based on Notes Promotes Knowledge Discovery. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems(Yokohama, Japan)(CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 726, 14 page...

  28. [28]

    1999.All Life is Problem Solving

    Karl Popper. 1999.All Life is Problem Solving. Routledge, London. Translated by Patrick Camiller

  29. [29]

    Navid Rekabsaz, Oleg Lesota, Markus Schedl, Jon Brassey, and Carsten Eickhoff

  30. [30]

    InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(Virtual Event, Canada)(SIGIR ’21)

    TripClick: The Log Files of a Large Health Web Search Engine. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(Virtual Event, Canada)(SIGIR ’21). Association for Computing Machinery, New York, NY, USA, 2507–2513. doi:10.1145/3404835. 3463242

  31. [31]

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: language agents with verbal reinforcement learning. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 377, 19 pages

  32. [32]

    Herbert A. Simon. 1978. Information-Processing Theory of Human Problem Solving. https://api.semanticscholar.org/CorpusID:10344827

  33. [33]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Con- gcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Ha...

  34. [34]

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese

  35. [35]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents. arXiv:2504.12516 [cs.CL] https://arxiv.org/abs/2504.12516

  36. [36]

    Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, and Fei Huang. 2025. Web- Walker: Benchmarking LLMs in Web Traversal. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, a...

  37. [37]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  38. [38]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL] https://arxiv.org/abs/2210.03629

  39. [39]

    Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, and Jiecao Chen

  40. [40]

    Agent-r: Train- ing language model agents to reflect via iterative self-training.arXiv preprint arXiv:2501.11425, 2025

    Agent-R: Training Language Model Agents to Reflect via Iterative Self- Training. arXiv:2501.11425 [cs.AI] https://arxiv.org/abs/2501.11425

  41. [41]

    Jingtao Zhan, Jiahao Zhao, Jiayu Li, Yiqun Liu, Bo Zhang, Qingyao Ai, Jiaxin Mao, Hongning Wang, Min Zhang, and Shaoping Ma. 2025. Evaluating Intelligence via Trial and Error. arXiv:2502.18858 [cs.AI] https://arxiv.org/abs/2502.18858

  42. [42]

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou

  43. [43]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. arXiv:2506.05176 [cs.CL] https://arxiv.org/abs/2506.05176

  44. [44]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854 [cs.AI] https://arxiv.org/abs/2307.13854