Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

Dongqi Huang; Hongbang Yuan; Jiachun Li; Jinshan Gao; Jun Zhao; Kang Liu; Kejian Zhu; Lingshuai Wang; Longxiang Wang; Lu Wang

arxiv: 2606.12191 · v1 · pith:FYDB6HCUnew · submitted 2026-06-10 · 💻 cs.CL · cs.AI

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

Jiachun Li , Zhuoran Jin , Tianyi Men , Yupu Hao , Kejian Zhu , Lingshuai Wang , Dongqi Huang , Longxiang Wang

show 7 more authors

Shengjia Hua Lu Wang Jinshan Gao Hongbang Yuan Ruilin Xu Kang Liu Jun Zhao

This is my paper

Pith reviewed 2026-06-27 09:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords agentic environmentsLLM agentsenvironment modelingenvironment synthesisenvironment evaluationagent-environment co-evolutionsymbolic synthesisneural synthesis

0 comments

The pith

Environments for LLM-based agents are engineered through modeling, synthesis, evaluation, and application to support continual capability evolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish a systematic view of agentic environments by organizing existing work around the full lifecycle of environment engineering. It details how environments are modeled using eight attributes in eight domains, synthesized with symbolic or neural methods, evaluated accordingly, and applied to let agents and environments evolve together. This matters because it identifies specific mechanisms like four agent evolution pathways and three environment evolution paradigms that can be used to advance LLM agents in interactive settings. A sympathetic reader would see this as a map for building better interactive systems that push model limits.

Core claim

Environments serve as interactive systems for large language model based agents across diverse scenarios and play a crucial role in driving the continual evolution of model capabilities. The paper systematically studies current researches on agentic environments from the perspective of the environment engineering lifecycle, covering their modeling, synthesis, evaluation and application. It introduces representative environments from the perspectives of eight attributes and eight domains, two paradigms for automated environment synthesis, different environment evaluation methods, and from the agent-environment co-evolution perspective, four complementary perspectives for agent evolution and t

What carries the argument

The environment engineering lifecycle, which organizes the study of agentic environments into stages of modeling, synthesis, evaluation, and application, with co-evolution as the application focus.

If this is right

Representative environments are classified using eight attributes and eight domains.
Automated synthesis uses symbolic synthesis and neural synthesis paradigms.
Evaluation methods are tied to each synthesis paradigm.
Agent evolution follows memory-centric, orchestration-centric, trajectory-centric or exploration-centric paths.
Environment evolution is neural-driven, difficulty-driven or scaling-driven.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Organizing environments this way may reveal opportunities to create hybrid environments that combine symbolic and neural synthesis for more robust testing.
The four agent evolution perspectives could be combined in single systems to test if they produce synergistic improvements.
Future environment design might prioritize scaling-driven approaches if they prove most effective at pushing model limits.
This survey implies that progress in LLM agents depends as much on environment quality as on model architecture.

Load-bearing premise

The representative environments, paradigms, and pathways selected from the literature provide a comprehensive and unbiased view of the field without significant omissions in coverage or categorization.

What would settle it

Identification of an important agentic environment or evolution method that does not align with the proposed eight attributes and domains, two synthesis paradigms, four agent evolution perspectives, or three environment evolution paradigms.

Figures

Figures reproduced from arXiv: 2606.12191 by Dongqi Huang, Hongbang Yuan, Jiachun Li, Jinshan Gao, Jun Zhao, Kang Liu, Kejian Zhu, Lingshuai Wang, Longxiang Wang, Lu Wang, Ruilin Xu, Shengjia Hua, Tianyi Men, Yupu Hao, Zhuoran Jin.

**Figure 2.** Figure 2: Taxonomy of the main content of this survey. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Taxonomy of the main content of this survey. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: A comparison between data engineering and environment engi [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: An overview of environment attributes. and how the environment dynamics are generated. This determines whether the system relies on programmed code to update the states or uses a neural model to predict future states. 3.1.1 Symbolic Environment In a symbolic environment, the transition dynamics are governed by explicit programmed logic and predefined rules. The transition function P(st+1∣st, at) is realiz… view at source ↗

**Figure 6.** Figure 6: An overview of environment domains, including GUI, Deep Research, Embodied, Game, Tool, Code, and Domain-Specific. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Three symbolic environment synthesis methods are presented: Task-Driven Synthesis, Real-World-Driven Synthesis, and De Novo [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Three neural environment synthesis paradigms are presented: Pixel-Level Modeling, Word-Level Modeling, and Latent-Level Modeling, [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Overview of agent evolution paradigms. Existing methods are organized into four categories: Memory-Centric Experience Evolution, which [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Overview of environment evolution paradigms. Existing methods are organized into three categories: Neural-Driven Evolution, which evolves environments through self-play or world models; Difficulty-Driven Evolution, which adapts task difficulty via explicit curriculum signals or implicit curriculum mechanisms; and Scaling-Driven Evolution, which expands environment diversity at the scenario or environment … view at source ↗

read the original abstract

Environments serve as interactive systems for large language model (LLM) based agents across diverse scenarios and play a crucial role in driving the continual evolution of model capabilities. Despite this importance, existing work lacks a systematic categorization and deep analysis. This paper systematically studies current researches on agentic environments from the perspective of the environment engineering lifecycle, covering their modeling, synthesis, evaluation and application. Specifically, the paper first introduces representative environments from the perspectives of eight attributes and eight domains, providing detailed analyses of their development paths and highlighting their core capabilities. Second, for automated environment synthesis, two paradigms are introduced, such as symbolic synthesis and neural synthesis. This paper also shows different environment evaluation methods in each paradigm. Thirdly, the corresponding environment applications from the perspective of agent-environment co-evolution are discussed. In specific, the paper characterizes the primary pathways for agent evolution in dynamic environments from four complementary perspectives: memory-centric experience evolution, orchestration-centric workflow evolution, trajectory-centric offline evolution, and exploration-centric online evolution. And three paradigms of environment evolution are identified, namely neural-driven, difficulty-driven, and scaling-driven approaches. At last, several promising future directions are discussed, including Environment-as-a-Service, Multi-agent Environments, and Neural-Symbolic Environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A survey that maps LLM agent environments into lifecycle stages and evolution paths but leaves its selection process and completeness unshown.

read the letter

This paper is a survey that organizes existing work on environments for LLM-based agents around a four-stage lifecycle: modeling, synthesis, evaluation, and application. The core contribution is a set of categories—eight attributes, eight domains, two synthesis paradigms, four agent-evolution pathways, and three environment-evolution paradigms—plus some discussion of future directions like Environment-as-a-Service.

It does a decent job pulling representative examples together and sketching how environments have developed in different domains. The breakdown of synthesis into symbolic versus neural approaches, and the four pathways for agent evolution (memory-centric, orchestration-centric, trajectory-centric, exploration-centric), gives a workable way to talk about the literature. That framing is clearer than many ad-hoc lists.

The main weakness is that the claim of systematic coverage rests on unstated choices. The abstract gives no search strategy, inclusion rules, or temporal bounds, so it is impossible to tell whether the chosen representatives actually cover the field or simply reflect what the authors read. If major lines of work in embodied agents or multi-agent setups were left out, the lifecycle view does not organize the area so much as rearrange a subset of it. As a survey it also adds no new experiments, proofs, or datasets.

The paper is mainly useful to researchers already active in LLM agents who want a single document that points to a range of environments and papers. It is not the place to look for new methods or rigorous validation of any taxonomy.

I would send it to peer review. The topic is active and a well-documented map could help, but the authors need to show their literature process was thorough before the categories can be treated as reliable.

Referee Report

1 major / 0 minor

Summary. The paper claims to deliver a systematic survey of agentic environments for LLM-based agents, organized around the environment engineering lifecycle. It first categorizes representative environments via eight attributes and eight domains, analyzing their development paths and core capabilities; second, it presents two automated synthesis paradigms (symbolic and neural) together with associated evaluation methods; third, it examines applications through agent-environment co-evolution, characterizing four complementary evolution pathways (memory-centric experience evolution, orchestration-centric workflow evolution, trajectory-centric offline evolution, exploration-centric online evolution) and three environment-evolution paradigms (neural-driven, difficulty-driven, scaling-driven); finally, it outlines future directions including Environment-as-a-Service, Multi-agent Environments, and Neural-Symbolic Environments.

Significance. If the selected representatives and categorizations are shown to be comprehensive, the survey would provide a useful organizing framework for an emerging area by foregrounding the co-evolution of agents and environments. The lifecycle perspective and explicit enumeration of pathways and paradigms constitute a concrete contribution that could guide subsequent research; the paper earns credit for attempting to synthesize diverse lines of work into a structured lifecycle view rather than a simple enumeration.

major comments (1)

[Introduction / survey scope] Introduction / survey scope: The abstract asserts that the paper 'systematically studies current researches' and selects 'representative environments' across eight attributes, eight domains, two synthesis paradigms, four evolution pathways, and three environment-evolution paradigms, yet no description is given of the literature search strategy, databases, keywords, temporal scope, or inclusion/exclusion criteria. This is load-bearing for the central claim, because without such details it is impossible to verify whether the chosen categories constitute a comprehensive, unbiased view or a post-hoc organization that may omit major lines of work (e.g., certain embodied or multi-agent setups).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the survey scope and methodology. We agree that explicit details on the literature search process are necessary to support the claim of a systematic survey and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Introduction / survey scope] Introduction / survey scope: The abstract asserts that the paper 'systematically studies current researches' and selects 'representative environments' across eight attributes, eight domains, two synthesis paradigms, four evolution pathways, and three environment-evolution paradigms, yet no description is given of the literature search strategy, databases, keywords, temporal scope, or inclusion/exclusion criteria. This is load-bearing for the central claim, because without such details it is impossible to verify whether the chosen categories constitute a comprehensive, unbiased view or a post-hoc organization that may omit major lines of work (e.g., certain embodied or multi-agent setups).

Authors: We acknowledge the validity of this observation. The current manuscript does not include an explicit description of the search strategy. In the revised version we will insert a new subsection (tentatively titled 'Survey Methodology') immediately after the Introduction that specifies: (1) databases queried (arXiv, ACL Anthology, NeurIPS/ICLR/ICML proceedings, and Google Scholar), (2) primary keywords and Boolean combinations (e.g., "LLM agent" AND (environment OR simulator OR benchmark)), (3) temporal scope (primarily 2022–2024 with selected foundational works), and (4) inclusion/exclusion criteria (interactive environments supporting LLM-based agents; exclusion of purely static datasets or non-agentic simulators). We will also add a brief limitations paragraph noting that, while the eight domains and attributes aim to capture the dominant research threads, certain specialized embodied or multi-agent configurations may receive lighter coverage and are flagged for future expansion. These additions will allow readers to assess the representativeness of the selected categories. revision: yes

Circularity Check

0 steps flagged

No circularity: survey organizes external literature

full rationale

This is a literature survey paper with no mathematical derivations, equations, fitted parameters, predictions, or self-referential claims. The central content consists of categorizing and summarizing cited external works across modeling, synthesis, evaluation, and application. No load-bearing step reduces by construction to the paper's own inputs or self-citations. Selection of representatives is presented as a review of the field rather than a derived result, so concerns about coverage fall under completeness rather than circularity. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper that organizes existing literature on agentic environments for LLMs; it introduces no new free parameters, axioms, or invented entities beyond standard assumptions in AI literature review methodology.

pith-pipeline@v0.9.1-grok · 5805 in / 1170 out tokens · 14870 ms · 2026-06-27T09:45:02.026088+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning
cs.CL 2026-06 unverdicted novelty 6.0

PEEU enables a 7B MLLM to reach 30.6% accuracy on GUI task planning by autonomous exploration and hindsight experience synthesis, outperforming a 32B model through stronger high-level OOD generalization.
Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It
cs.CL 2026-06 unverdicted novelty 5.0

RL for LLM multi-step tool use collapses from control token probability spikes but interleaving SFT improves stability at the cost of OOD generalization.
Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do
cs.CL 2026-06 unverdicted novelty 5.0

Multimodal CoT improves math and science reasoning but degrades visual perception performance and exhibits a pattern of diminishing visual reflection while verbal reflection varies.

Reference graph

Works this paper leans on

299 extracted references · 128 canonical work pages · cited by 3 Pith papers · 29 internal anchors

[1]

Qwen3 technical report,

Q. Team, “Qwen3 technical report,”CoRR, vol. abs/2505.09388,

Pith/arXiv arXiv
[2]

BrainOmni: A Brain Foundation Model for Unified EEG and MEG Signals

[Online]. Available: https://doi.org/10.48550/arXiv.2505. 09388

work page doi:10.48550/arxiv.2505
[3]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

DeepSeek-AI, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”CoRR, vol. abs/2501.12948,

Pith/arXiv arXiv
[4]

Available: https://doi.org/10.48550/arXiv.2501

[Online]. Available: https://doi.org/10.48550/arXiv.2501. 12948

work page doi:10.48550/arxiv.2501
[5]

GEM: A gym for agentic llms,

Z. Liu, A. Sims, K. Duan, C. Chen, S. Yu, X. Zhou, H. Xu, S. Xiong, B. Liu, C. Tan, C. Y. Beh, W. Wang, H. Zhu, W. Shi, D. Yang, M. Shieh, Y. W. Teh, W. S. Lee, and M. Lin, “GEM: A gym for agentic llms,”CoRR, vol. abs/2510.01051, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2510.01051

work page doi:10.48550/arxiv.2510.01051 2025
[6]

Humanity’s last code exam: Can advanced llms conquer human’s hardest code competition?

X. Li, X. Li, K. Dong, Q. Zhang, R. Ruan, X. Dai, Y. Wang, and R. Tang, “Humanity’s last code exam: Can advanced llms conquer human’s hardest code competition?” inFindings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November 4-9, 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Association for Compu...

2025
[7]

(2025) Introducing gpt-5.4

OpenAI. (2025) Introducing gpt-5.4. Accessed: 2026- 03-16. [Online]. Available: https://openai.com/index/ introducing-gpt-5-4/

2025
[8]

(2025) Gemini 3.1 pro

Google DeepMind. (2025) Gemini 3.1 pro. Accessed: 2026-03-16. [Online]. Available: https://deepmind.google/models/gemini/ pro/

2025
[9]

Kimi K2.5: visual agentic intelligence,

K. Team, “Kimi K2.5: visual agentic intelligence,”CoRR, vol. abs/2602.02276, 2026. [Online]. Available: https://doi.org/10. 48550/arXiv.2602.02276

Pith/arXiv arXiv 2026
[10]

ToolRL: Reward is All Tool Learning Needs

C. Qian, E. C. Acikgoz, Q. He, H. Wang, X. Chen, D. Hakkani- Tür, G. Tur, and H. Ji, “Toolrl: Reward is all tool learning needs,”CoRR, vol. abs/2504.13958, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2504.13958

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.13958 2025
[11]

Travelplanner: A benchmark for real-world planning with language agents,

J. Xie, K. Zhang, J. Chen, T. Zhu, R. Lou, Y. Tian, Y. Xiao, and Y. Su, “Travelplanner: A benchmark for real-world planning with language agents,” inForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, ser. Proceedings of Machine Learning Research, R. Salakhutdinov, Z. Kolter, K. A. Heller, A. Weller, N. ...

2024
[12]

Self-refine: Iterative refinement with self- feedback,

A. Madaan, N. Tandon, P . Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Gupta, B. P . Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P . Clark, “Self-refine: Iterative refinement with self- feedback,” inAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing...

2023
[13]

Agent world model: Infinity synthetic environments for agentic reinforcement learning,

Z. Wang, C. Xu, B. Liu, Y. Wang, S. Han, Z. Yao, H. Yao, and Y. He, “Agent world model: Infinity synthetic environments for agentic reinforcement learning,”arXiv preprint arXiv:2602.10090, 2026

Pith/arXiv arXiv 2026
[14]

Rwku: Benchmarking real-world knowledge unlearning for large language models,

P . Cao, C. Wang, Z. He, H. Yuan, J. Li, Y. Chen, K. Liu, J. Zhao et al., “Rwku: Benchmarking real-world knowledge unlearning for large language models,” vol. 37, 2024, pp. 98 213–98 263

2024
[15]

A trouble- maker with contagious jailbreak makes chaos in honest towns,

T. Men, P . Cao, Z. Jin, Y. Chen, K. Liu, and J. Zhao, “A trouble- maker with contagious jailbreak makes chaos in honest towns,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 17 561– 17 587

2025
[16]

Dacomp: Benchmarking data agents across the full data intelligence lifecycle,

F. Lei, J. Meng, Y. Huang, J. Zhao, Y. Zhang, J. Luo, X. Zou, R. Yang, W. Shi, Y. Gaoet al., “Dacomp: Benchmarking data agents across the full data intelligence lifecycle,”arXiv preprint arXiv:2512.04324, 2025

arXiv 2025
[17]

Webshop: Towards scalable real-world web interaction with grounded language agents,

S. Yao, H. Chen, J. Yang, and K. Narasimhan, “Webshop: Towards scalable real-world web interaction with grounded language agents,” inAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, S. Koyejo, S. Mohamed, A. Agarwal, D....

2022
[18]

Swe-bench: Can language models resolve real-world github issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “Swe-bench: Can language models resolve real-world github issues?” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. [Online]. Available: https://openreview.net/forum?id=VTF8yNQM66

2024
[19]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” inThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. [Online]. Available: https: //openreview.net/forum?id=WE_vluYUL-X

2023
[20]

Fixing the broken compass: Diagnosing and improving inference-time reward modeling,

J. Li, P . Cao, Y. Chen, J. Xu, H. Li, X. Jiang, K. Liu, and J. Zhao, “Fixing the broken compass: Diagnosing and improving inference-time reward modeling,” 2025

2025
[21]

Omni-reward: Towards generalist omni-modal reward model- ing with free-form preferences,

Z. Jin, H. Yuan, K. Zhu, J. Li, P . Cao, Y. Chen, K. Liu, and J. Zhao, “Omni-reward: Towards generalist omni-modal reward model- ing with free-form preferences,”arXiv preprint arXiv:2510.23451, 2025

arXiv 2025
[22]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P . Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. K. Li, Y. Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”CoRR, vol. abs/2402.03300, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2402.03300

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2024
[23]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, T. Fan, G. Liu, L. Liu, X. Liu, H. Lin, Z. Lin, B. Ma, G. Sheng, Y. Tong, C. Zhang, M. Zhang, W. Zhang, H. Zhu, J. Zhu, J. Chen, J. Chen, C. Wang, H. Yu, W. Dai, Y. Song, X. Wei, H. Zhou, J. Liu, W. Ma, Y. Zhang, L. Yan, M. Qiao, Y. Wu, and M. Wang, “DAPO: an open-source LLM reinforcement learning system a...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14476 2025
[24]

Reinforced internal-external knowledge synergistic reasoning for efficient adaptive search agent,

Z. Huang, X. Yuan, Y. Ju, J. Zhao, and K. Liu, “Reinforced internal-external knowledge synergistic reasoning for efficient adaptive search agent,”arXiv preprint arXiv:2505.07596, 2025

arXiv 2025
[25]

Towards agentic self-learning llms in search environ- ment,

W. Sun, X. Cheng, J. Fan, Y. Xu, X. Yu, S. He, J. Zhao, and K. Liu, “Towards agentic self-learning llms in search environ- ment,”arXiv preprint arXiv:2510.14253, 2025

arXiv 2025
[26]

Agentic Reasoning for Large Language Models

T. Wei, T. Li, Z. Liu, X. Ning, Z. Yang, J. Zou, Z. Zeng, R. Qiu, X. Lin, D. Fu, Z. Li, M. Ai, D. Zhou, W. Bao, Y. Li, G. Li, C. Qian, Y. Wang, X. Tang, Y. Xiao, L. Fang, H. Liu, X. Tang, Y. Zhang, C. Wang, J. You, H. Ji, H. Tong, and J. He, “Agentic reasoning for large language models,”CoRR, vol. abs/2601.12538, 2026. [Online]. Available: https://doi.org...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.12538 2026
[27]

Large language models for plan- ning: A comprehensive and systematic survey,

P . Cao, T. Men, W. Liu, J. Zhang, X. Li, X. Lin, D. Sui, Y. Cao, K. Liu, and J. Zhao, “Large language models for plan- ning: A comprehensive and systematic survey,”arXiv preprint arXiv:2505.19683, 2025

arXiv 2025
[28]

A survey of recent advances in commonsense knowledge acquisition: Methods and resources,

C. Wang, J. Li, Y. Chen, K. Liu, and J. Zhao, “A survey of recent advances in commonsense knowledge acquisition: Methods and resources,”Machine Intelligence Research, vol. 22, no. 2, pp. 201– 218, 2025

2025
[29]

Workarena: How capable are web agents at solving common knowledge work tasks?

A. Drouin, M. Gasse, M. Caccia, I. H. Laradji, M. D. Verme, T. Marty, D. Vázquez, N. Chapados, and A. Lacoste, “Workarena: How capable are web agents at solving common knowledge work tasks?” inForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21- 27, 2024, ser. Proceedings of Machine Learning Research, R. Salakhutd...

2024
[30]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments,

T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y. Liu, Y. Xu, S. Zhou, S. Savarese, C. Xiong, V . Zhong, and T. Yu, “Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments,” inAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processin...

2024
[31]

Measuring short-form factuality in large language models,

J. Wei, N. Karina, H. W. Chung, Y. J. Jiao, S. Papay, A. Glaese, J. Schulman, and W. Fedus, “Measuring short-form factuality in large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2411.04368

Pith/arXiv arXiv 2024
[32]

GAIA: a benchmark for general AI assistants,

G. Mialon, C. Fourrier, T. Wolf, Y. LeCun, and T. Scialom, “GAIA: a benchmark for general AI assistants,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. [Online]. Available: https://openreview.net/forum?id=fibxvahvs3

2024
[33]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

J. Wei, Z. Sun, S. Papay, S. McKinney, J. Han, I. Fulford, H. W. Chung, A. T. Passos, W. Fedus, and A. Glaese, “Browsecomp: A simple yet challenging benchmark for browsing agents,”CoRR, vol. abs/2504.12516, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2504.12516

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.12516 2025
[34]

ALFRED: A benchmark for interpreting grounded instructions for everyday tasks,

M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “ALFRED: A benchmark for interpreting grounded instructions for everyday tasks,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, 2020, pp. 10 737–10 746. [On...

2020
[35]

Alfworld: Aligning text and embodied environments for interactive learning,

M. Shridhar, X. Yuan, M. Côté, Y. Bisk, A. Trischler, and M. J. Hausknecht, “Alfworld: Aligning text and embodied environments for interactive learning,” in9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. [Online]. Available: https://openreview.net/forum?id=0IOX0YcCdTn

2021
[36]

Scienceworld: Is your agent smarter than a 5th grader?

R. Wang, P . A. Jansen, M. Côté, and P . Ammanabrolu, “Scienceworld: Is your agent smarter than a 5th grader?” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds. Association for Computational Linguistics, 20...

work page doi:10.18653/v1/2022.emnlp-main.775 2022
[37]

Gamearena: Evaluating llm reasoning through live computer games,

L. Hu, Q. Li, A. Xie, N. Jiang, I. Stoica, H. Jin, and H. Zhang, “Gamearena: Evaluating llm reasoning through live computer games,” 2025. [Online]. Available: https: //arxiv.org/abs/2412.06394

arXiv 2025
[38]

Baba is ai: Break the rules to beat the benchmark,

N. Cloos, M. Jens, M. Naim, Y.-L. Kuo, I. Cases, A. Barbu, and C. J. Cueva, “Baba is ai: Break the rules to beat the benchmark,”
[39]

Available: https://arxiv.org/abs/2407.13729

[Online]. Available: https://arxiv.org/abs/2407.13729

arXiv
[40]

Gamebench: Evaluating strategic reasoning abilities of llm agents,

A. Costarelli, M. Allen, R. Hauksson, G. Sodunke, S. Hariharan, C. Cheng, W. Li, J. Clymer, and A. Yadav, “Gamebench: Evaluating strategic reasoning abilities of llm agents,” 2024. [Online]. Available: https://arxiv.org/abs/2406.06613

arXiv 2024
[41]

Toolllm: Facilitating large language models to master 16000+ real-world apis,

Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “Toolllm: Facilitating large language models to master 16000+ real-world apis,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7- 11, 202...

2024
[42]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

S. Yao, N. Shinn, P . Razavi, and K. Narasimhan, “τ-bench: A benchmark for tool-agent-user interaction in real-world domains,”CoRR, vol. abs/2406.12045, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2406.12045

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.12045 2024
[43]

Api-bank: A comprehensive benchmark for tool-augmented llms,

M. Li, Y. Zhao, B. Yu, F. Song, H. Li, H. Yu, Z. Li, F. Huang, and Y. Li, “Api-bank: A comprehensive benchmark for tool-augmented llms,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Association for Computational Linguistics, 2023...

2023
[44]

Program synthesis with large language models,

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Do- han, E. Jiang, C. Cai, M. Terry, Q. Leet al., “Program synthesis with large language models,”arXiv preprint arXiv:2108.07732, 2021

Pith/arXiv arXiv 2021
[45]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

M. A. Merrill, A. G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J. Y. Shin, T. Walshe, E. K. Buchanan, J. Shen, G. Ye, H. Lin, J. Poulos, M. Wang, M. Nezhurina, J. Jitsev, D. Lu, O. Menis-Mastromichalakis, Z. Xu, Z. Chen, Y. Liu, R. Zhang, L. L. Chen, A. Kashyap, J. Uslu, J. Li, J. Wu, M. Yan, S. Bian, V . Sharma, K. Sun, S. Dillmann, A. Anand...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.11868 2026
[46]

Kernelbench: Can llms write efficient gpu kernels?

A. Ouyang, S. Guo, S. Arora, A. L. Zhang, W. Hu, C. Ré, and A. Mirhoseini, “Kernelbench: Can llms write efficient gpu kernels?”arXiv preprint arXiv:2502.10517, 2025

Pith/arXiv arXiv 2025
[47]

Medagentbench: Dataset for benchmarking llms as agents in medical applications,

Y. Jiang, K. C. Black, G. Geng, D. Park, A. Y. Ng, and J. H. Chen, “Medagentbench: Dataset for benchmarking llms as agents in medical applications,”CoRR, vol. abs/2501.14654, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2501.14654

work page doi:10.48550/arxiv.2501.14654 2025
[48]

Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery,

Z. Chen, S. Chen, Y. Ning, Q. Zhang, B. Wang, B. Yu, Y. Li, Z. Liao, C. Wei, Z. Lu, V . Dey, M. Xue, F. N. Baker, B. Burns, D. Adu- Ampratwum, X. Huang, X. Ning, S. Gao, Y. Su, and H. Sun, “Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery,” inThe Thirteenth International Conference on Learning Represent...

2025
[49]

Dsbench: How far are data science agents from becoming data science experts?

L. Jing, Z. Huang, X. Wang, W. Yao, W. Yu, K. Ma, H. Zhang, X. Du, and D. Yu, “Dsbench: How far are data science agents from becoming data science experts?” inThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. [Online]. Available: https://openreview.net/forum?id=DSsSPr0RZJ

2025
[50]

Openai gym,

G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schul- man, J. Tang, and W. Zaremba, “Openai gym,”arXiv preprint arXiv:1606.01540, 2016

Pith/arXiv arXiv 2016
[51]

Agentbench: Evaluating llms as agents,

X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y. Su, H. Sun, M. Huang, Y. Dong, and J. Tang, “Agentbench: Evaluating llms as agents,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.n...

2024
[52]

Agentscourt: Building judicial decision-making agents with court debate simulation and legal knowledge augmenta- tion,

Z. He, P . Cao, C. Wang, Z. Jin, Y. Chen, J. Xu, H. Li, K. Liu, and J. Zhao, “Agentscourt: Building judicial decision-making agents with court debate simulation and legal knowledge augmenta- tion,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 9399–9416

2024
[53]

Mmr-v: What’s left unsaid? a benchmark for multi- modal deep reasoning in videos,

K. Zhu, Z. Jin, H. Yuan, J. Li, S. Tu, P . Cao, Y. Chen, K. Liu, and J. Zhao, “Mmr-v: What’s left unsaid? a benchmark for multi- modal deep reasoning in videos,”arXiv preprint arXiv:2506.04141, 2025

arXiv 2025
[54]

Mmr-life: Piecing together real-life scenes for multi- modal multi-image reasoning,

J. Li, S. Huang, Z. Jin, C. Zhang, P . Cao, Y. Chen, K. Liu, and J. Zhao, “Mmr-life: Piecing together real-life scenes for multi- modal multi-image reasoning,”arXiv preprint arXiv:2603.02024, 2026

arXiv 2026
[55]

Windows agent arena: Evaluating multi-modal os agents at scale,

R. Bonatti, D. Zhao, F. Bonacci, D. Dupont, S. Abdali, Y. Li, Y. Lu, J. Wagle, K. Koishida, A. Buckeret al., “Windows agent arena: Evaluating multi-modal os agents at scale,”arXiv preprint arXiv:2409.08264, 2024. JOURNAL OF LATEX CLASS FILES, JANUARY 2025 45

arXiv 2024
[56]

Osworld-mcp: Benchmarking MCP tool invocation in computer-use agents,

H. Jia, J. Liao, X. Zhang, H. Xu, T. Xie, C. Jiang, M. Yan, S. Liu, W. Ye, and F. Huang, “Osworld-mcp: Benchmarking MCP tool invocation in computer-use agents,” CoRR, vol. abs/2510.24563, 2025. [Online]. Available: https: //doi.org/10.48550/arXiv.2510.24563

work page doi:10.48550/arxiv.2510.24563 2025
[57]

Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction,

D. Zhang, Z. Shen, R. Xie, S. Zhang, T. Xie, Z. Zhao, S. Chen, L. Chen, H. Xu, R. Caoet al., “Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction,”arXiv preprint arXiv:2305.08144, 2023

arXiv 2023
[58]

Android in the wild: A large-scale dataset for android device control,

C. Rawles, A. Li, D. Rodriguez, O. Riva, and T. P . Lillicrap, “Android in the wild: A large-scale dataset for android device control,”CoRR, vol. abs/2307.10088, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2307.10088

work page doi:10.48550/arxiv.2307.10088 2023
[59]

Androidworld: A dynamic benchmarking environment for autonomous agents,

C. Rawles, S. Clinckemaillie, Y. Chang, J. Waltz, G. Lau, M. Fair, A. Li, W. E. Bishop, W. Li, F. Campbell-Ajala, D. K. Toyama, R. J. Berry, D. Tyamagundlu, T. P . Lillicrap, and O. Riva, “Androidworld: A dynamic benchmarking environment for autonomous agents,” inThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, Ap...

2025
[60]

Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive and mcp-augmented environments,

Q. Kong, X. Zhang, Z. Yang, N. Gao, C. Liu, P . Tong, C. Cai, H. Zhou, J. Zhang, L. Chen, Z. Liu, S. Hoi, and Y. Wang, “Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive and mcp-augmented environments,”CoRR, vol. abs/2512.19432, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2512.19432

work page doi:10.48550/arxiv.2512.19432 2025
[61]

Mobile-bench: An evaluation benchmark for llm-based mobile agents,

S. Deng, W. Xu, H. Sun, W. Liu, T. Tan, L. Liujianfeng, A. Li, J. Luan, B. Wang, R. Yanet al., “Mobile-bench: An evaluation benchmark for llm-based mobile agents,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 8813–8831

2024
[62]

Mind2web: Towards a generalist agent for the web,

X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y. Su, “Mind2web: Towards a generalist agent for the web,” inAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Ha...

2023
[63]

Webarena: A realistic web environment for building autonomous agents,

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y. Bisk, D. Fried, U. Alon, and G. Neubig, “Webarena: A realistic web environment for building autonomous agents,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

2024
[64]

Available: https://openreview.net/forum?id= oKn9c6ytLx

[Online]. Available: https://openreview.net/forum?id= oKn9c6ytLx
[65]

2024 , publisher =

J. Y. Koh, R. Lo, L. Jang, V . Duvvur, M. C. Lim, P . Huang, G. Neubig, S. Zhou, R. Salakhutdinov, and D. Fried, “Visualwebarena: Evaluating multimodal agents on realistic visual web tasks,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, L....

work page doi:10.18653/v1/2024.acl-long.50 2024
[66]

Widesearch: Benchmarking agentic broad info- seeking,

R. Wong, J. Wang, J. Zhao, L. Chen, Y. Gao, L. Zhang, X. Zhou, Z. Wang, K. Xiang, G. Zhang, W. Huang, Y. Wang, and K. Wang, “Widesearch: Benchmarking agentic broad info- seeking,”CoRR, vol. abs/2508.07999, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2508.07999

work page doi:10.48550/arxiv.2508.07999 2025
[67]

Infodeepseek: Benchmarking agentic information seeking for retrieval- augmented generation,

Y. Xi, J. Lin, M. Zhu, Y. Xiao, Z. Ou, J. Liu, T. Wan, B. Chen, W. Liu, Y. Wang, R. Tang, W. Zhang, and Y. Yu, “Infodeepseek: Benchmarking agentic information seeking for retrieval- augmented generation,”CoRR, vol. abs/2505.15872, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2505.15872

work page doi:10.48550/arxiv.2505.15872 2025
[68]

Open data synthesis for deep research,

Z. Xia, K. Luo, H. Qian, and Z. Liu, “Open data synthesis for deep research,”CoRR, vol. abs/2509.00375, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2509.00375

work page doi:10.48550/arxiv.2509.00375 2025
[69]

Mmdeepresearch-bench: A benchmark for multimodal deep research agents,

P . Huang, Z. Zhong, Z. Wan, D. Zhou, S. Alam, X. Wang, Z. Li, Z. Dou, L. Zhu, J. Xiong, C. Tao, Y. Xu, D. Dimitriadis, T. Zhang, and M. Zhang, “Mmdeepresearch-bench: A benchmark for multimodal deep research agents,”CoRR, vol. abs/2601.12346,

arXiv
[70]

arXiv preprint arXiv:2601.22259 (2026).https://doi.org/10.48550/arXiv.2601

[Online]. Available: https://doi.org/10.48550/arXiv.2601. 12346

work page doi:10.48550/arxiv.2601
[71]

Webwalker: Benchmarking llms in web traversal,

J. Wu, W. Yin, Y. Jiang, Z. Wang, Z. Xi, R. Fang, L. Zhang, Y. He, D. Zhou, P . Xie, and F. Huang, “Webwalker: Benchmarking llms in web traversal,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, W. Che, J. Nabende, E. Shutova, and M. T....

2025
[72]

Dragged into conflicts: Detecting and addressing conflicting sources in search-augmented llms,

A. Cattan, A. Jacovi, O. Ram, J. Herzig, R. Aharoni, S. Goldshtein, E. Ofek, I. Szpektor, and A. Caciularu, “Dragged into conflicts: Detecting and addressing conflicting sources in search-augmented llms,”CoRR, vol. abs/2506.08500, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2506.08500

work page doi:10.48550/arxiv.2506.08500 2025
[73]

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

P . Zhou, B. Leon, X. Ying, C. Zhang, Y. Shao, Q. Ye, D. Chong, Z. Jin, C. Xie, M. Cao, Y. Gu, S. Hong, J. Ren, J. Chen, C. Liu, and Y. Hua, “Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese,”CoRR, vol. abs/2504.19314, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2504.19314

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.19314 2025
[74]

Characterizing deep research: A benchmark and formal definition,

A. Java, A. Khandelwal, S. P . Midigeshi, A. Halfaker, A. Deshpande, N. Goyal, A. Gupta, N. Natarajan, and A. Sharma, “Characterizing deep research: A benchmark and formal definition,”CoRR, vol. abs/2508.04183, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2508.04183

work page doi:10.48550/arxiv.2508.04183 2025
[75]

Omnigaia: Towards native omni-modal AI agents,

X. Li, W. Jiao, J. Jin, S. Wang, G. Dong, J. Jin, H. Wang, Y. Wang, J. Wen, Y. Lu, and Z. Dou, “Omnigaia: Towards native omni-modal AI agents,”CoRR, vol. abs/2602.22897, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2602.22897

work page doi:10.48550/arxiv.2602.22897 2026
[76]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

M. Du, B. Xu, C. Zhu, X. Wang, and Z. Mao, “Deepresearch bench: A comprehensive benchmark for deep research agents,” CoRR, vol. abs/2506.11763, 2025. [Online]. Available: https: //doi.org/10.48550/arXiv.2506.11763

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.11763 2025
[77]

Multimodal deepresearcher: Generating text-chart interleaved reports from scratch with agentic framework,

Z. Yang, B. Pan, H. Wang, Y. Wang, X. Liu, M. Zhu, B. Zhang, and W. Chen, “Multimodal deepresearcher: Generating text-chart interleaved reports from scratch with agentic framework,” CoRR, vol. abs/2506.02454, 2025. [Online]. Available: https: //doi.org/10.48550/arXiv.2506.02454

work page doi:10.48550/arxiv.2506.02454 2025
[78]

Dr. bench: A multidimensional evaluation for deep research agents, from answers to reports,

Y. Yao, Y. Wang, Y. Zhang, Y. Lu, T. Gu, L. Li, D. Zhao, K. Wu, H. Wang, P . Nie, Y. Teng, and Y. Wang, “Dr. bench: A multidimensional evaluation for deep research agents, from answers to reports,” 2026. [Online]. Available: https://arxiv.org/abs/2510.02190

arXiv 2026
[79]

Surveygen: Quality-aware scientific survey generation with large language models,

T. Bao, M. T. Nayeem, D. Rafiei, and C. Zhang, “Surveygen: Quality-aware scientific survey generation with large language models,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Association for Comput...

work page doi:10.18653/v1/2025.emnlp-main.136 2025
[80]

Reportbench: Evaluating deep research agents via academic survey tasks,

M. Li, Y. Zeng, Z. Cheng, C. Ma, and K. Jia, “Reportbench: Evaluating deep research agents via academic survey tasks,” CoRR, vol. abs/2508.15804, 2025. [Online]. Available: https: //doi.org/10.48550/arXiv.2508.15804

work page doi:10.48550/arxiv.2508.15804 2025

Showing first 80 references.

[1] [1]

Qwen3 technical report,

Q. Team, “Qwen3 technical report,”CoRR, vol. abs/2505.09388,

Pith/arXiv arXiv

[2] [2]

BrainOmni: A Brain Foundation Model for Unified EEG and MEG Signals

[Online]. Available: https://doi.org/10.48550/arXiv.2505. 09388

work page doi:10.48550/arxiv.2505

[3] [3]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

DeepSeek-AI, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”CoRR, vol. abs/2501.12948,

Pith/arXiv arXiv

[4] [4]

Available: https://doi.org/10.48550/arXiv.2501

[Online]. Available: https://doi.org/10.48550/arXiv.2501. 12948

work page doi:10.48550/arxiv.2501

[5] [5]

GEM: A gym for agentic llms,

Z. Liu, A. Sims, K. Duan, C. Chen, S. Yu, X. Zhou, H. Xu, S. Xiong, B. Liu, C. Tan, C. Y. Beh, W. Wang, H. Zhu, W. Shi, D. Yang, M. Shieh, Y. W. Teh, W. S. Lee, and M. Lin, “GEM: A gym for agentic llms,”CoRR, vol. abs/2510.01051, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2510.01051

work page doi:10.48550/arxiv.2510.01051 2025

[6] [6]

Humanity’s last code exam: Can advanced llms conquer human’s hardest code competition?

X. Li, X. Li, K. Dong, Q. Zhang, R. Ruan, X. Dai, Y. Wang, and R. Tang, “Humanity’s last code exam: Can advanced llms conquer human’s hardest code competition?” inFindings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November 4-9, 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Association for Compu...

2025

[7] [7]

(2025) Introducing gpt-5.4

OpenAI. (2025) Introducing gpt-5.4. Accessed: 2026- 03-16. [Online]. Available: https://openai.com/index/ introducing-gpt-5-4/

2025

[8] [8]

(2025) Gemini 3.1 pro

Google DeepMind. (2025) Gemini 3.1 pro. Accessed: 2026-03-16. [Online]. Available: https://deepmind.google/models/gemini/ pro/

2025

[9] [9]

Kimi K2.5: visual agentic intelligence,

K. Team, “Kimi K2.5: visual agentic intelligence,”CoRR, vol. abs/2602.02276, 2026. [Online]. Available: https://doi.org/10. 48550/arXiv.2602.02276

Pith/arXiv arXiv 2026

[10] [10]

ToolRL: Reward is All Tool Learning Needs

C. Qian, E. C. Acikgoz, Q. He, H. Wang, X. Chen, D. Hakkani- Tür, G. Tur, and H. Ji, “Toolrl: Reward is all tool learning needs,”CoRR, vol. abs/2504.13958, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2504.13958

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.13958 2025

[11] [11]

Travelplanner: A benchmark for real-world planning with language agents,

J. Xie, K. Zhang, J. Chen, T. Zhu, R. Lou, Y. Tian, Y. Xiao, and Y. Su, “Travelplanner: A benchmark for real-world planning with language agents,” inForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, ser. Proceedings of Machine Learning Research, R. Salakhutdinov, Z. Kolter, K. A. Heller, A. Weller, N. ...

2024

[12] [12]

Self-refine: Iterative refinement with self- feedback,

A. Madaan, N. Tandon, P . Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Gupta, B. P . Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P . Clark, “Self-refine: Iterative refinement with self- feedback,” inAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing...

2023

[13] [13]

Agent world model: Infinity synthetic environments for agentic reinforcement learning,

Z. Wang, C. Xu, B. Liu, Y. Wang, S. Han, Z. Yao, H. Yao, and Y. He, “Agent world model: Infinity synthetic environments for agentic reinforcement learning,”arXiv preprint arXiv:2602.10090, 2026

Pith/arXiv arXiv 2026

[14] [14]

Rwku: Benchmarking real-world knowledge unlearning for large language models,

P . Cao, C. Wang, Z. He, H. Yuan, J. Li, Y. Chen, K. Liu, J. Zhao et al., “Rwku: Benchmarking real-world knowledge unlearning for large language models,” vol. 37, 2024, pp. 98 213–98 263

2024

[15] [15]

A trouble- maker with contagious jailbreak makes chaos in honest towns,

T. Men, P . Cao, Z. Jin, Y. Chen, K. Liu, and J. Zhao, “A trouble- maker with contagious jailbreak makes chaos in honest towns,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 17 561– 17 587

2025

[16] [16]

Dacomp: Benchmarking data agents across the full data intelligence lifecycle,

F. Lei, J. Meng, Y. Huang, J. Zhao, Y. Zhang, J. Luo, X. Zou, R. Yang, W. Shi, Y. Gaoet al., “Dacomp: Benchmarking data agents across the full data intelligence lifecycle,”arXiv preprint arXiv:2512.04324, 2025

arXiv 2025

[17] [17]

Webshop: Towards scalable real-world web interaction with grounded language agents,

S. Yao, H. Chen, J. Yang, and K. Narasimhan, “Webshop: Towards scalable real-world web interaction with grounded language agents,” inAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, S. Koyejo, S. Mohamed, A. Agarwal, D....

2022

[18] [18]

Swe-bench: Can language models resolve real-world github issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “Swe-bench: Can language models resolve real-world github issues?” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. [Online]. Available: https://openreview.net/forum?id=VTF8yNQM66

2024

[19] [19]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” inThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. [Online]. Available: https: //openreview.net/forum?id=WE_vluYUL-X

2023

[20] [20]

Fixing the broken compass: Diagnosing and improving inference-time reward modeling,

J. Li, P . Cao, Y. Chen, J. Xu, H. Li, X. Jiang, K. Liu, and J. Zhao, “Fixing the broken compass: Diagnosing and improving inference-time reward modeling,” 2025

2025

[21] [21]

Omni-reward: Towards generalist omni-modal reward model- ing with free-form preferences,

Z. Jin, H. Yuan, K. Zhu, J. Li, P . Cao, Y. Chen, K. Liu, and J. Zhao, “Omni-reward: Towards generalist omni-modal reward model- ing with free-form preferences,”arXiv preprint arXiv:2510.23451, 2025

arXiv 2025

[22] [22]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P . Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. K. Li, Y. Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”CoRR, vol. abs/2402.03300, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2402.03300

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2024

[23] [23]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, T. Fan, G. Liu, L. Liu, X. Liu, H. Lin, Z. Lin, B. Ma, G. Sheng, Y. Tong, C. Zhang, M. Zhang, W. Zhang, H. Zhu, J. Zhu, J. Chen, J. Chen, C. Wang, H. Yu, W. Dai, Y. Song, X. Wei, H. Zhou, J. Liu, W. Ma, Y. Zhang, L. Yan, M. Qiao, Y. Wu, and M. Wang, “DAPO: an open-source LLM reinforcement learning system a...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14476 2025

[24] [24]

Reinforced internal-external knowledge synergistic reasoning for efficient adaptive search agent,

Z. Huang, X. Yuan, Y. Ju, J. Zhao, and K. Liu, “Reinforced internal-external knowledge synergistic reasoning for efficient adaptive search agent,”arXiv preprint arXiv:2505.07596, 2025

arXiv 2025

[25] [25]

Towards agentic self-learning llms in search environ- ment,

W. Sun, X. Cheng, J. Fan, Y. Xu, X. Yu, S. He, J. Zhao, and K. Liu, “Towards agentic self-learning llms in search environ- ment,”arXiv preprint arXiv:2510.14253, 2025

arXiv 2025

[26] [26]

Agentic Reasoning for Large Language Models

T. Wei, T. Li, Z. Liu, X. Ning, Z. Yang, J. Zou, Z. Zeng, R. Qiu, X. Lin, D. Fu, Z. Li, M. Ai, D. Zhou, W. Bao, Y. Li, G. Li, C. Qian, Y. Wang, X. Tang, Y. Xiao, L. Fang, H. Liu, X. Tang, Y. Zhang, C. Wang, J. You, H. Ji, H. Tong, and J. He, “Agentic reasoning for large language models,”CoRR, vol. abs/2601.12538, 2026. [Online]. Available: https://doi.org...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.12538 2026

[27] [27]

Large language models for plan- ning: A comprehensive and systematic survey,

P . Cao, T. Men, W. Liu, J. Zhang, X. Li, X. Lin, D. Sui, Y. Cao, K. Liu, and J. Zhao, “Large language models for plan- ning: A comprehensive and systematic survey,”arXiv preprint arXiv:2505.19683, 2025

arXiv 2025

[28] [28]

A survey of recent advances in commonsense knowledge acquisition: Methods and resources,

C. Wang, J. Li, Y. Chen, K. Liu, and J. Zhao, “A survey of recent advances in commonsense knowledge acquisition: Methods and resources,”Machine Intelligence Research, vol. 22, no. 2, pp. 201– 218, 2025

2025

[29] [29]

Workarena: How capable are web agents at solving common knowledge work tasks?

A. Drouin, M. Gasse, M. Caccia, I. H. Laradji, M. D. Verme, T. Marty, D. Vázquez, N. Chapados, and A. Lacoste, “Workarena: How capable are web agents at solving common knowledge work tasks?” inForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21- 27, 2024, ser. Proceedings of Machine Learning Research, R. Salakhutd...

2024

[30] [30]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments,

T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y. Liu, Y. Xu, S. Zhou, S. Savarese, C. Xiong, V . Zhong, and T. Yu, “Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments,” inAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processin...

2024

[31] [31]

Measuring short-form factuality in large language models,

J. Wei, N. Karina, H. W. Chung, Y. J. Jiao, S. Papay, A. Glaese, J. Schulman, and W. Fedus, “Measuring short-form factuality in large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2411.04368

Pith/arXiv arXiv 2024

[32] [32]

GAIA: a benchmark for general AI assistants,

G. Mialon, C. Fourrier, T. Wolf, Y. LeCun, and T. Scialom, “GAIA: a benchmark for general AI assistants,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. [Online]. Available: https://openreview.net/forum?id=fibxvahvs3

2024

[33] [33]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

J. Wei, Z. Sun, S. Papay, S. McKinney, J. Han, I. Fulford, H. W. Chung, A. T. Passos, W. Fedus, and A. Glaese, “Browsecomp: A simple yet challenging benchmark for browsing agents,”CoRR, vol. abs/2504.12516, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2504.12516

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.12516 2025

[34] [34]

ALFRED: A benchmark for interpreting grounded instructions for everyday tasks,

M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “ALFRED: A benchmark for interpreting grounded instructions for everyday tasks,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, 2020, pp. 10 737–10 746. [On...

2020

[35] [35]

Alfworld: Aligning text and embodied environments for interactive learning,

M. Shridhar, X. Yuan, M. Côté, Y. Bisk, A. Trischler, and M. J. Hausknecht, “Alfworld: Aligning text and embodied environments for interactive learning,” in9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. [Online]. Available: https://openreview.net/forum?id=0IOX0YcCdTn

2021

[36] [36]

Scienceworld: Is your agent smarter than a 5th grader?

R. Wang, P . A. Jansen, M. Côté, and P . Ammanabrolu, “Scienceworld: Is your agent smarter than a 5th grader?” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds. Association for Computational Linguistics, 20...

work page doi:10.18653/v1/2022.emnlp-main.775 2022

[37] [37]

Gamearena: Evaluating llm reasoning through live computer games,

L. Hu, Q. Li, A. Xie, N. Jiang, I. Stoica, H. Jin, and H. Zhang, “Gamearena: Evaluating llm reasoning through live computer games,” 2025. [Online]. Available: https: //arxiv.org/abs/2412.06394

arXiv 2025

[38] [38]

Baba is ai: Break the rules to beat the benchmark,

N. Cloos, M. Jens, M. Naim, Y.-L. Kuo, I. Cases, A. Barbu, and C. J. Cueva, “Baba is ai: Break the rules to beat the benchmark,”

[39] [39]

Available: https://arxiv.org/abs/2407.13729

[Online]. Available: https://arxiv.org/abs/2407.13729

arXiv

[40] [40]

Gamebench: Evaluating strategic reasoning abilities of llm agents,

A. Costarelli, M. Allen, R. Hauksson, G. Sodunke, S. Hariharan, C. Cheng, W. Li, J. Clymer, and A. Yadav, “Gamebench: Evaluating strategic reasoning abilities of llm agents,” 2024. [Online]. Available: https://arxiv.org/abs/2406.06613

arXiv 2024

[41] [41]

Toolllm: Facilitating large language models to master 16000+ real-world apis,

Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “Toolllm: Facilitating large language models to master 16000+ real-world apis,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7- 11, 202...

2024

[42] [42]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

S. Yao, N. Shinn, P . Razavi, and K. Narasimhan, “τ-bench: A benchmark for tool-agent-user interaction in real-world domains,”CoRR, vol. abs/2406.12045, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2406.12045

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.12045 2024

[43] [43]

Api-bank: A comprehensive benchmark for tool-augmented llms,

M. Li, Y. Zhao, B. Yu, F. Song, H. Li, H. Yu, Z. Li, F. Huang, and Y. Li, “Api-bank: A comprehensive benchmark for tool-augmented llms,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Association for Computational Linguistics, 2023...

2023

[44] [44]

Program synthesis with large language models,

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Do- han, E. Jiang, C. Cai, M. Terry, Q. Leet al., “Program synthesis with large language models,”arXiv preprint arXiv:2108.07732, 2021

Pith/arXiv arXiv 2021

[45] [45]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

M. A. Merrill, A. G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J. Y. Shin, T. Walshe, E. K. Buchanan, J. Shen, G. Ye, H. Lin, J. Poulos, M. Wang, M. Nezhurina, J. Jitsev, D. Lu, O. Menis-Mastromichalakis, Z. Xu, Z. Chen, Y. Liu, R. Zhang, L. L. Chen, A. Kashyap, J. Uslu, J. Li, J. Wu, M. Yan, S. Bian, V . Sharma, K. Sun, S. Dillmann, A. Anand...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.11868 2026

[46] [46]

Kernelbench: Can llms write efficient gpu kernels?

A. Ouyang, S. Guo, S. Arora, A. L. Zhang, W. Hu, C. Ré, and A. Mirhoseini, “Kernelbench: Can llms write efficient gpu kernels?”arXiv preprint arXiv:2502.10517, 2025

Pith/arXiv arXiv 2025

[47] [47]

Medagentbench: Dataset for benchmarking llms as agents in medical applications,

Y. Jiang, K. C. Black, G. Geng, D. Park, A. Y. Ng, and J. H. Chen, “Medagentbench: Dataset for benchmarking llms as agents in medical applications,”CoRR, vol. abs/2501.14654, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2501.14654

work page doi:10.48550/arxiv.2501.14654 2025

[48] [48]

Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery,

Z. Chen, S. Chen, Y. Ning, Q. Zhang, B. Wang, B. Yu, Y. Li, Z. Liao, C. Wei, Z. Lu, V . Dey, M. Xue, F. N. Baker, B. Burns, D. Adu- Ampratwum, X. Huang, X. Ning, S. Gao, Y. Su, and H. Sun, “Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery,” inThe Thirteenth International Conference on Learning Represent...

2025

[49] [49]

Dsbench: How far are data science agents from becoming data science experts?

L. Jing, Z. Huang, X. Wang, W. Yao, W. Yu, K. Ma, H. Zhang, X. Du, and D. Yu, “Dsbench: How far are data science agents from becoming data science experts?” inThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. [Online]. Available: https://openreview.net/forum?id=DSsSPr0RZJ

2025

[50] [50]

Openai gym,

G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schul- man, J. Tang, and W. Zaremba, “Openai gym,”arXiv preprint arXiv:1606.01540, 2016

Pith/arXiv arXiv 2016

[51] [51]

Agentbench: Evaluating llms as agents,

X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y. Su, H. Sun, M. Huang, Y. Dong, and J. Tang, “Agentbench: Evaluating llms as agents,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.n...

2024

[52] [52]

Agentscourt: Building judicial decision-making agents with court debate simulation and legal knowledge augmenta- tion,

Z. He, P . Cao, C. Wang, Z. Jin, Y. Chen, J. Xu, H. Li, K. Liu, and J. Zhao, “Agentscourt: Building judicial decision-making agents with court debate simulation and legal knowledge augmenta- tion,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 9399–9416

2024

[53] [53]

Mmr-v: What’s left unsaid? a benchmark for multi- modal deep reasoning in videos,

K. Zhu, Z. Jin, H. Yuan, J. Li, S. Tu, P . Cao, Y. Chen, K. Liu, and J. Zhao, “Mmr-v: What’s left unsaid? a benchmark for multi- modal deep reasoning in videos,”arXiv preprint arXiv:2506.04141, 2025

arXiv 2025

[54] [54]

Mmr-life: Piecing together real-life scenes for multi- modal multi-image reasoning,

J. Li, S. Huang, Z. Jin, C. Zhang, P . Cao, Y. Chen, K. Liu, and J. Zhao, “Mmr-life: Piecing together real-life scenes for multi- modal multi-image reasoning,”arXiv preprint arXiv:2603.02024, 2026

arXiv 2026

[55] [55]

Windows agent arena: Evaluating multi-modal os agents at scale,

R. Bonatti, D. Zhao, F. Bonacci, D. Dupont, S. Abdali, Y. Li, Y. Lu, J. Wagle, K. Koishida, A. Buckeret al., “Windows agent arena: Evaluating multi-modal os agents at scale,”arXiv preprint arXiv:2409.08264, 2024. JOURNAL OF LATEX CLASS FILES, JANUARY 2025 45

arXiv 2024

[56] [56]

Osworld-mcp: Benchmarking MCP tool invocation in computer-use agents,

H. Jia, J. Liao, X. Zhang, H. Xu, T. Xie, C. Jiang, M. Yan, S. Liu, W. Ye, and F. Huang, “Osworld-mcp: Benchmarking MCP tool invocation in computer-use agents,” CoRR, vol. abs/2510.24563, 2025. [Online]. Available: https: //doi.org/10.48550/arXiv.2510.24563

work page doi:10.48550/arxiv.2510.24563 2025

[57] [57]

Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction,

D. Zhang, Z. Shen, R. Xie, S. Zhang, T. Xie, Z. Zhao, S. Chen, L. Chen, H. Xu, R. Caoet al., “Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction,”arXiv preprint arXiv:2305.08144, 2023

arXiv 2023

[58] [58]

Android in the wild: A large-scale dataset for android device control,

C. Rawles, A. Li, D. Rodriguez, O. Riva, and T. P . Lillicrap, “Android in the wild: A large-scale dataset for android device control,”CoRR, vol. abs/2307.10088, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2307.10088

work page doi:10.48550/arxiv.2307.10088 2023

[59] [59]

Androidworld: A dynamic benchmarking environment for autonomous agents,

C. Rawles, S. Clinckemaillie, Y. Chang, J. Waltz, G. Lau, M. Fair, A. Li, W. E. Bishop, W. Li, F. Campbell-Ajala, D. K. Toyama, R. J. Berry, D. Tyamagundlu, T. P . Lillicrap, and O. Riva, “Androidworld: A dynamic benchmarking environment for autonomous agents,” inThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, Ap...

2025

[60] [60]

Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive and mcp-augmented environments,

Q. Kong, X. Zhang, Z. Yang, N. Gao, C. Liu, P . Tong, C. Cai, H. Zhou, J. Zhang, L. Chen, Z. Liu, S. Hoi, and Y. Wang, “Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive and mcp-augmented environments,”CoRR, vol. abs/2512.19432, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2512.19432

work page doi:10.48550/arxiv.2512.19432 2025

[61] [61]

Mobile-bench: An evaluation benchmark for llm-based mobile agents,

S. Deng, W. Xu, H. Sun, W. Liu, T. Tan, L. Liujianfeng, A. Li, J. Luan, B. Wang, R. Yanet al., “Mobile-bench: An evaluation benchmark for llm-based mobile agents,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 8813–8831

2024

[62] [62]

Mind2web: Towards a generalist agent for the web,

X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y. Su, “Mind2web: Towards a generalist agent for the web,” inAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Ha...

2023

[63] [63]

Webarena: A realistic web environment for building autonomous agents,

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y. Bisk, D. Fried, U. Alon, and G. Neubig, “Webarena: A realistic web environment for building autonomous agents,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

2024

[64] [64]

Available: https://openreview.net/forum?id= oKn9c6ytLx

[Online]. Available: https://openreview.net/forum?id= oKn9c6ytLx

[65] [65]

2024 , publisher =

J. Y. Koh, R. Lo, L. Jang, V . Duvvur, M. C. Lim, P . Huang, G. Neubig, S. Zhou, R. Salakhutdinov, and D. Fried, “Visualwebarena: Evaluating multimodal agents on realistic visual web tasks,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, L....

work page doi:10.18653/v1/2024.acl-long.50 2024

[66] [66]

Widesearch: Benchmarking agentic broad info- seeking,

R. Wong, J. Wang, J. Zhao, L. Chen, Y. Gao, L. Zhang, X. Zhou, Z. Wang, K. Xiang, G. Zhang, W. Huang, Y. Wang, and K. Wang, “Widesearch: Benchmarking agentic broad info- seeking,”CoRR, vol. abs/2508.07999, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2508.07999

work page doi:10.48550/arxiv.2508.07999 2025

[67] [67]

Infodeepseek: Benchmarking agentic information seeking for retrieval- augmented generation,

Y. Xi, J. Lin, M. Zhu, Y. Xiao, Z. Ou, J. Liu, T. Wan, B. Chen, W. Liu, Y. Wang, R. Tang, W. Zhang, and Y. Yu, “Infodeepseek: Benchmarking agentic information seeking for retrieval- augmented generation,”CoRR, vol. abs/2505.15872, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2505.15872

work page doi:10.48550/arxiv.2505.15872 2025

[68] [68]

Open data synthesis for deep research,

Z. Xia, K. Luo, H. Qian, and Z. Liu, “Open data synthesis for deep research,”CoRR, vol. abs/2509.00375, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2509.00375

work page doi:10.48550/arxiv.2509.00375 2025

[69] [69]

Mmdeepresearch-bench: A benchmark for multimodal deep research agents,

P . Huang, Z. Zhong, Z. Wan, D. Zhou, S. Alam, X. Wang, Z. Li, Z. Dou, L. Zhu, J. Xiong, C. Tao, Y. Xu, D. Dimitriadis, T. Zhang, and M. Zhang, “Mmdeepresearch-bench: A benchmark for multimodal deep research agents,”CoRR, vol. abs/2601.12346,

arXiv

[70] [70]

arXiv preprint arXiv:2601.22259 (2026).https://doi.org/10.48550/arXiv.2601

[Online]. Available: https://doi.org/10.48550/arXiv.2601. 12346

work page doi:10.48550/arxiv.2601

[71] [71]

Webwalker: Benchmarking llms in web traversal,

J. Wu, W. Yin, Y. Jiang, Z. Wang, Z. Xi, R. Fang, L. Zhang, Y. He, D. Zhou, P . Xie, and F. Huang, “Webwalker: Benchmarking llms in web traversal,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, W. Che, J. Nabende, E. Shutova, and M. T....

2025

[72] [72]

Dragged into conflicts: Detecting and addressing conflicting sources in search-augmented llms,

A. Cattan, A. Jacovi, O. Ram, J. Herzig, R. Aharoni, S. Goldshtein, E. Ofek, I. Szpektor, and A. Caciularu, “Dragged into conflicts: Detecting and addressing conflicting sources in search-augmented llms,”CoRR, vol. abs/2506.08500, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2506.08500

work page doi:10.48550/arxiv.2506.08500 2025

[73] [73]

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

P . Zhou, B. Leon, X. Ying, C. Zhang, Y. Shao, Q. Ye, D. Chong, Z. Jin, C. Xie, M. Cao, Y. Gu, S. Hong, J. Ren, J. Chen, C. Liu, and Y. Hua, “Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese,”CoRR, vol. abs/2504.19314, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2504.19314

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.19314 2025

[74] [74]

Characterizing deep research: A benchmark and formal definition,

A. Java, A. Khandelwal, S. P . Midigeshi, A. Halfaker, A. Deshpande, N. Goyal, A. Gupta, N. Natarajan, and A. Sharma, “Characterizing deep research: A benchmark and formal definition,”CoRR, vol. abs/2508.04183, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2508.04183

work page doi:10.48550/arxiv.2508.04183 2025

[75] [75]

Omnigaia: Towards native omni-modal AI agents,

X. Li, W. Jiao, J. Jin, S. Wang, G. Dong, J. Jin, H. Wang, Y. Wang, J. Wen, Y. Lu, and Z. Dou, “Omnigaia: Towards native omni-modal AI agents,”CoRR, vol. abs/2602.22897, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2602.22897

work page doi:10.48550/arxiv.2602.22897 2026

[76] [76]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

M. Du, B. Xu, C. Zhu, X. Wang, and Z. Mao, “Deepresearch bench: A comprehensive benchmark for deep research agents,” CoRR, vol. abs/2506.11763, 2025. [Online]. Available: https: //doi.org/10.48550/arXiv.2506.11763

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.11763 2025

[77] [77]

Multimodal deepresearcher: Generating text-chart interleaved reports from scratch with agentic framework,

Z. Yang, B. Pan, H. Wang, Y. Wang, X. Liu, M. Zhu, B. Zhang, and W. Chen, “Multimodal deepresearcher: Generating text-chart interleaved reports from scratch with agentic framework,” CoRR, vol. abs/2506.02454, 2025. [Online]. Available: https: //doi.org/10.48550/arXiv.2506.02454

work page doi:10.48550/arxiv.2506.02454 2025

[78] [78]

Dr. bench: A multidimensional evaluation for deep research agents, from answers to reports,

Y. Yao, Y. Wang, Y. Zhang, Y. Lu, T. Gu, L. Li, D. Zhao, K. Wu, H. Wang, P . Nie, Y. Teng, and Y. Wang, “Dr. bench: A multidimensional evaluation for deep research agents, from answers to reports,” 2026. [Online]. Available: https://arxiv.org/abs/2510.02190

arXiv 2026

[79] [79]

Surveygen: Quality-aware scientific survey generation with large language models,

T. Bao, M. T. Nayeem, D. Rafiei, and C. Zhang, “Surveygen: Quality-aware scientific survey generation with large language models,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Association for Comput...

work page doi:10.18653/v1/2025.emnlp-main.136 2025

[80] [80]

Reportbench: Evaluating deep research agents via academic survey tasks,

M. Li, Y. Zeng, Z. Cheng, C. Ma, and K. Jia, “Reportbench: Evaluating deep research agents via academic survey tasks,” CoRR, vol. abs/2508.15804, 2025. [Online]. Available: https: //doi.org/10.48550/arXiv.2508.15804

work page doi:10.48550/arxiv.2508.15804 2025