pith. sign in

arxiv: 2606.12191 · v1 · pith:FYDB6HCUnew · submitted 2026-06-10 · 💻 cs.CL · cs.AI

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

Pith reviewed 2026-06-27 09:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords agentic environmentsLLM agentsenvironment modelingenvironment synthesisenvironment evaluationagent-environment co-evolutionsymbolic synthesisneural synthesis
0
0 comments X

The pith

Environments for LLM-based agents are engineered through modeling, synthesis, evaluation, and application to support continual capability evolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish a systematic view of agentic environments by organizing existing work around the full lifecycle of environment engineering. It details how environments are modeled using eight attributes in eight domains, synthesized with symbolic or neural methods, evaluated accordingly, and applied to let agents and environments evolve together. This matters because it identifies specific mechanisms like four agent evolution pathways and three environment evolution paradigms that can be used to advance LLM agents in interactive settings. A sympathetic reader would see this as a map for building better interactive systems that push model limits.

Core claim

Environments serve as interactive systems for large language model based agents across diverse scenarios and play a crucial role in driving the continual evolution of model capabilities. The paper systematically studies current researches on agentic environments from the perspective of the environment engineering lifecycle, covering their modeling, synthesis, evaluation and application. It introduces representative environments from the perspectives of eight attributes and eight domains, two paradigms for automated environment synthesis, different environment evaluation methods, and from the agent-environment co-evolution perspective, four complementary perspectives for agent evolution and t

What carries the argument

The environment engineering lifecycle, which organizes the study of agentic environments into stages of modeling, synthesis, evaluation, and application, with co-evolution as the application focus.

If this is right

  • Representative environments are classified using eight attributes and eight domains.
  • Automated synthesis uses symbolic synthesis and neural synthesis paradigms.
  • Evaluation methods are tied to each synthesis paradigm.
  • Agent evolution follows memory-centric, orchestration-centric, trajectory-centric or exploration-centric paths.
  • Environment evolution is neural-driven, difficulty-driven or scaling-driven.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Organizing environments this way may reveal opportunities to create hybrid environments that combine symbolic and neural synthesis for more robust testing.
  • The four agent evolution perspectives could be combined in single systems to test if they produce synergistic improvements.
  • Future environment design might prioritize scaling-driven approaches if they prove most effective at pushing model limits.
  • This survey implies that progress in LLM agents depends as much on environment quality as on model architecture.

Load-bearing premise

The representative environments, paradigms, and pathways selected from the literature provide a comprehensive and unbiased view of the field without significant omissions in coverage or categorization.

What would settle it

Identification of an important agentic environment or evolution method that does not align with the proposed eight attributes and domains, two synthesis paradigms, four agent evolution perspectives, or three environment evolution paradigms.

Figures

Figures reproduced from arXiv: 2606.12191 by Dongqi Huang, Hongbang Yuan, Jiachun Li, Jinshan Gao, Jun Zhao, Kang Liu, Kejian Zhu, Lingshuai Wang, Longxiang Wang, Lu Wang, Ruilin Xu, Shengjia Hua, Tianyi Men, Yupu Hao, Zhuoran Jin.

Figure 1
Figure 1. Figure 1: An overview of agentic environment engineering. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Taxonomy of the main content of this survey. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Taxonomy of the main content of this survey. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A comparison between data engineering and environment engi [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: An overview of environment attributes. and how the environment dynamics are generated. This determines whether the system relies on programmed code to update the states or uses a neural model to predict future states. 3.1.1 Symbolic Environment In a symbolic environment, the transition dynamics are gov￾erned by explicit programmed logic and predefined rules. The transition function P(st+1∣st, at) is realiz… view at source ↗
Figure 6
Figure 6. Figure 6: An overview of environment domains, including GUI, Deep Research, Embodied, Game, Tool, Code, and Domain-Specific. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Three symbolic environment synthesis methods are presented: Task-Driven Synthesis, Real-World-Driven Synthesis, and De Novo [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Three neural environment synthesis paradigms are presented: Pixel-Level Modeling, Word-Level Modeling, and Latent-Level Modeling, [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Overview of agent evolution paradigms. Existing methods are organized into four categories: Memory-Centric Experience Evolution, which [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Overview of environment evolution paradigms. Existing methods are organized into three categories: Neural-Driven Evolution, which evolves environments through self-play or world models; Difficulty-Driven Evolution, which adapts task difficulty via explicit curriculum signals or implicit curriculum mechanisms; and Scaling-Driven Evolution, which expands environment diversity at the scenario or environment … view at source ↗
read the original abstract

Environments serve as interactive systems for large language model (LLM) based agents across diverse scenarios and play a crucial role in driving the continual evolution of model capabilities. Despite this importance, existing work lacks a systematic categorization and deep analysis. This paper systematically studies current researches on agentic environments from the perspective of the environment engineering lifecycle, covering their modeling, synthesis, evaluation and application. Specifically, the paper first introduces representative environments from the perspectives of eight attributes and eight domains, providing detailed analyses of their development paths and highlighting their core capabilities. Second, for automated environment synthesis, two paradigms are introduced, such as symbolic synthesis and neural synthesis. This paper also shows different environment evaluation methods in each paradigm. Thirdly, the corresponding environment applications from the perspective of agent-environment co-evolution are discussed. In specific, the paper characterizes the primary pathways for agent evolution in dynamic environments from four complementary perspectives: memory-centric experience evolution, orchestration-centric workflow evolution, trajectory-centric offline evolution, and exploration-centric online evolution. And three paradigms of environment evolution are identified, namely neural-driven, difficulty-driven, and scaling-driven approaches. At last, several promising future directions are discussed, including Environment-as-a-Service, Multi-agent Environments, and Neural-Symbolic Environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims to deliver a systematic survey of agentic environments for LLM-based agents, organized around the environment engineering lifecycle. It first categorizes representative environments via eight attributes and eight domains, analyzing their development paths and core capabilities; second, it presents two automated synthesis paradigms (symbolic and neural) together with associated evaluation methods; third, it examines applications through agent-environment co-evolution, characterizing four complementary evolution pathways (memory-centric experience evolution, orchestration-centric workflow evolution, trajectory-centric offline evolution, exploration-centric online evolution) and three environment-evolution paradigms (neural-driven, difficulty-driven, scaling-driven); finally, it outlines future directions including Environment-as-a-Service, Multi-agent Environments, and Neural-Symbolic Environments.

Significance. If the selected representatives and categorizations are shown to be comprehensive, the survey would provide a useful organizing framework for an emerging area by foregrounding the co-evolution of agents and environments. The lifecycle perspective and explicit enumeration of pathways and paradigms constitute a concrete contribution that could guide subsequent research; the paper earns credit for attempting to synthesize diverse lines of work into a structured lifecycle view rather than a simple enumeration.

major comments (1)
  1. [Introduction / survey scope] Introduction / survey scope: The abstract asserts that the paper 'systematically studies current researches' and selects 'representative environments' across eight attributes, eight domains, two synthesis paradigms, four evolution pathways, and three environment-evolution paradigms, yet no description is given of the literature search strategy, databases, keywords, temporal scope, or inclusion/exclusion criteria. This is load-bearing for the central claim, because without such details it is impossible to verify whether the chosen categories constitute a comprehensive, unbiased view or a post-hoc organization that may omit major lines of work (e.g., certain embodied or multi-agent setups).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the survey scope and methodology. We agree that explicit details on the literature search process are necessary to support the claim of a systematic survey and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Introduction / survey scope] Introduction / survey scope: The abstract asserts that the paper 'systematically studies current researches' and selects 'representative environments' across eight attributes, eight domains, two synthesis paradigms, four evolution pathways, and three environment-evolution paradigms, yet no description is given of the literature search strategy, databases, keywords, temporal scope, or inclusion/exclusion criteria. This is load-bearing for the central claim, because without such details it is impossible to verify whether the chosen categories constitute a comprehensive, unbiased view or a post-hoc organization that may omit major lines of work (e.g., certain embodied or multi-agent setups).

    Authors: We acknowledge the validity of this observation. The current manuscript does not include an explicit description of the search strategy. In the revised version we will insert a new subsection (tentatively titled 'Survey Methodology') immediately after the Introduction that specifies: (1) databases queried (arXiv, ACL Anthology, NeurIPS/ICLR/ICML proceedings, and Google Scholar), (2) primary keywords and Boolean combinations (e.g., "LLM agent" AND (environment OR simulator OR benchmark)), (3) temporal scope (primarily 2022–2024 with selected foundational works), and (4) inclusion/exclusion criteria (interactive environments supporting LLM-based agents; exclusion of purely static datasets or non-agentic simulators). We will also add a brief limitations paragraph noting that, while the eight domains and attributes aim to capture the dominant research threads, certain specialized embodied or multi-agent configurations may receive lighter coverage and are flagged for future expansion. These additions will allow readers to assess the representativeness of the selected categories. revision: yes

Circularity Check

0 steps flagged

No circularity: survey organizes external literature

full rationale

This is a literature survey paper with no mathematical derivations, equations, fitted parameters, predictions, or self-referential claims. The central content consists of categorizing and summarizing cited external works across modeling, synthesis, evaluation, and application. No load-bearing step reduces by construction to the paper's own inputs or self-citations. Selection of representatives is presented as a review of the field rather than a derived result, so concerns about coverage fall under completeness rather than circularity. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper that organizes existing literature on agentic environments for LLMs; it introduces no new free parameters, axioms, or invented entities beyond standard assumptions in AI literature review methodology.

pith-pipeline@v0.9.1-grok · 5805 in / 1170 out tokens · 14870 ms · 2026-06-27T09:45:02.026088+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning

    cs.CL 2026-06 unverdicted novelty 6.0

    PEEU enables a 7B MLLM to reach 30.6% accuracy on GUI task planning by autonomous exploration and hindsight experience synthesis, outperforming a 32B model through stronger high-level OOD generalization.

  2. Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It

    cs.CL 2026-06 unverdicted novelty 5.0

    RL for LLM multi-step tool use collapses from control token probability spikes but interleaving SFT improves stability at the cost of OOD generalization.

  3. Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do

    cs.CL 2026-06 unverdicted novelty 5.0

    Multimodal CoT improves math and science reasoning but degrades visual perception performance and exhibits a pattern of diminishing visual reflection while verbal reflection varies.

Reference graph

Works this paper leans on

299 extracted references · 128 canonical work pages · cited by 3 Pith papers · 29 internal anchors

  1. [1]

    Qwen3 technical report,

    Q. Team, “Qwen3 technical report,”CoRR, vol. abs/2505.09388,

  2. [2]

    BrainOmni: A Brain Foundation Model for Unified EEG and MEG Signals

    [Online]. Available: https://doi.org/10.48550/arXiv.2505. 09388

  3. [3]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

    DeepSeek-AI, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”CoRR, vol. abs/2501.12948,

  4. [4]

    Available: https://doi.org/10.48550/arXiv.2501

    [Online]. Available: https://doi.org/10.48550/arXiv.2501. 12948

  5. [5]

    GEM: A gym for agentic llms,

    Z. Liu, A. Sims, K. Duan, C. Chen, S. Yu, X. Zhou, H. Xu, S. Xiong, B. Liu, C. Tan, C. Y. Beh, W. Wang, H. Zhu, W. Shi, D. Yang, M. Shieh, Y. W. Teh, W. S. Lee, and M. Lin, “GEM: A gym for agentic llms,”CoRR, vol. abs/2510.01051, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2510.01051

  6. [6]

    Humanity’s last code exam: Can advanced llms conquer human’s hardest code competition?

    X. Li, X. Li, K. Dong, Q. Zhang, R. Ruan, X. Dai, Y. Wang, and R. Tang, “Humanity’s last code exam: Can advanced llms conquer human’s hardest code competition?” inFindings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November 4-9, 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Association for Compu...

  7. [7]

    (2025) Introducing gpt-5.4

    OpenAI. (2025) Introducing gpt-5.4. Accessed: 2026- 03-16. [Online]. Available: https://openai.com/index/ introducing-gpt-5-4/

  8. [8]

    (2025) Gemini 3.1 pro

    Google DeepMind. (2025) Gemini 3.1 pro. Accessed: 2026-03-16. [Online]. Available: https://deepmind.google/models/gemini/ pro/

  9. [9]

    Kimi K2.5: visual agentic intelligence,

    K. Team, “Kimi K2.5: visual agentic intelligence,”CoRR, vol. abs/2602.02276, 2026. [Online]. Available: https://doi.org/10. 48550/arXiv.2602.02276

  10. [10]

    ToolRL: Reward is All Tool Learning Needs

    C. Qian, E. C. Acikgoz, Q. He, H. Wang, X. Chen, D. Hakkani- Tür, G. Tur, and H. Ji, “Toolrl: Reward is all tool learning needs,”CoRR, vol. abs/2504.13958, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2504.13958

  11. [11]

    Travelplanner: A benchmark for real-world planning with language agents,

    J. Xie, K. Zhang, J. Chen, T. Zhu, R. Lou, Y. Tian, Y. Xiao, and Y. Su, “Travelplanner: A benchmark for real-world planning with language agents,” inForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, ser. Proceedings of Machine Learning Research, R. Salakhutdinov, Z. Kolter, K. A. Heller, A. Weller, N. ...

  12. [12]

    Self-refine: Iterative refinement with self- feedback,

    A. Madaan, N. Tandon, P . Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Gupta, B. P . Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P . Clark, “Self-refine: Iterative refinement with self- feedback,” inAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing...

  13. [13]

    Agent world model: Infinity synthetic environments for agentic reinforcement learning,

    Z. Wang, C. Xu, B. Liu, Y. Wang, S. Han, Z. Yao, H. Yao, and Y. He, “Agent world model: Infinity synthetic environments for agentic reinforcement learning,”arXiv preprint arXiv:2602.10090, 2026

  14. [14]

    Rwku: Benchmarking real-world knowledge unlearning for large language models,

    P . Cao, C. Wang, Z. He, H. Yuan, J. Li, Y. Chen, K. Liu, J. Zhao et al., “Rwku: Benchmarking real-world knowledge unlearning for large language models,” vol. 37, 2024, pp. 98 213–98 263

  15. [15]

    A trouble- maker with contagious jailbreak makes chaos in honest towns,

    T. Men, P . Cao, Z. Jin, Y. Chen, K. Liu, and J. Zhao, “A trouble- maker with contagious jailbreak makes chaos in honest towns,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 17 561– 17 587

  16. [16]

    Dacomp: Benchmarking data agents across the full data intelligence lifecycle,

    F. Lei, J. Meng, Y. Huang, J. Zhao, Y. Zhang, J. Luo, X. Zou, R. Yang, W. Shi, Y. Gaoet al., “Dacomp: Benchmarking data agents across the full data intelligence lifecycle,”arXiv preprint arXiv:2512.04324, 2025

  17. [17]

    Webshop: Towards scalable real-world web interaction with grounded language agents,

    S. Yao, H. Chen, J. Yang, and K. Narasimhan, “Webshop: Towards scalable real-world web interaction with grounded language agents,” inAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, S. Koyejo, S. Mohamed, A. Agarwal, D....

  18. [18]

    Swe-bench: Can language models resolve real-world github issues?

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “Swe-bench: Can language models resolve real-world github issues?” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. [Online]. Available: https://openreview.net/forum?id=VTF8yNQM66

  19. [19]

    React: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” inThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. [Online]. Available: https: //openreview.net/forum?id=WE_vluYUL-X

  20. [20]

    Fixing the broken compass: Diagnosing and improving inference-time reward modeling,

    J. Li, P . Cao, Y. Chen, J. Xu, H. Li, X. Jiang, K. Liu, and J. Zhao, “Fixing the broken compass: Diagnosing and improving inference-time reward modeling,” 2025

  21. [21]

    Omni-reward: Towards generalist omni-modal reward model- ing with free-form preferences,

    Z. Jin, H. Yuan, K. Zhu, J. Li, P . Cao, Y. Chen, K. Liu, and J. Zhao, “Omni-reward: Towards generalist omni-modal reward model- ing with free-form preferences,”arXiv preprint arXiv:2510.23451, 2025

  22. [22]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P . Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. K. Li, Y. Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”CoRR, vol. abs/2402.03300, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2402.03300

  23. [23]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, T. Fan, G. Liu, L. Liu, X. Liu, H. Lin, Z. Lin, B. Ma, G. Sheng, Y. Tong, C. Zhang, M. Zhang, W. Zhang, H. Zhu, J. Zhu, J. Chen, J. Chen, C. Wang, H. Yu, W. Dai, Y. Song, X. Wei, H. Zhou, J. Liu, W. Ma, Y. Zhang, L. Yan, M. Qiao, Y. Wu, and M. Wang, “DAPO: an open-source LLM reinforcement learning system a...

  24. [24]

    Reinforced internal-external knowledge synergistic reasoning for efficient adaptive search agent,

    Z. Huang, X. Yuan, Y. Ju, J. Zhao, and K. Liu, “Reinforced internal-external knowledge synergistic reasoning for efficient adaptive search agent,”arXiv preprint arXiv:2505.07596, 2025

  25. [25]

    Towards agentic self-learning llms in search environ- ment,

    W. Sun, X. Cheng, J. Fan, Y. Xu, X. Yu, S. He, J. Zhao, and K. Liu, “Towards agentic self-learning llms in search environ- ment,”arXiv preprint arXiv:2510.14253, 2025

  26. [26]

    Agentic Reasoning for Large Language Models

    T. Wei, T. Li, Z. Liu, X. Ning, Z. Yang, J. Zou, Z. Zeng, R. Qiu, X. Lin, D. Fu, Z. Li, M. Ai, D. Zhou, W. Bao, Y. Li, G. Li, C. Qian, Y. Wang, X. Tang, Y. Xiao, L. Fang, H. Liu, X. Tang, Y. Zhang, C. Wang, J. You, H. Ji, H. Tong, and J. He, “Agentic reasoning for large language models,”CoRR, vol. abs/2601.12538, 2026. [Online]. Available: https://doi.org...

  27. [27]

    Large language models for plan- ning: A comprehensive and systematic survey,

    P . Cao, T. Men, W. Liu, J. Zhang, X. Li, X. Lin, D. Sui, Y. Cao, K. Liu, and J. Zhao, “Large language models for plan- ning: A comprehensive and systematic survey,”arXiv preprint arXiv:2505.19683, 2025

  28. [28]

    A survey of recent advances in commonsense knowledge acquisition: Methods and resources,

    C. Wang, J. Li, Y. Chen, K. Liu, and J. Zhao, “A survey of recent advances in commonsense knowledge acquisition: Methods and resources,”Machine Intelligence Research, vol. 22, no. 2, pp. 201– 218, 2025

  29. [29]

    Workarena: How capable are web agents at solving common knowledge work tasks?

    A. Drouin, M. Gasse, M. Caccia, I. H. Laradji, M. D. Verme, T. Marty, D. Vázquez, N. Chapados, and A. Lacoste, “Workarena: How capable are web agents at solving common knowledge work tasks?” inForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21- 27, 2024, ser. Proceedings of Machine Learning Research, R. Salakhutd...

  30. [30]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments,

    T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y. Liu, Y. Xu, S. Zhou, S. Savarese, C. Xiong, V . Zhong, and T. Yu, “Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments,” inAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processin...

  31. [31]

    Measuring short-form factuality in large language models,

    J. Wei, N. Karina, H. W. Chung, Y. J. Jiao, S. Papay, A. Glaese, J. Schulman, and W. Fedus, “Measuring short-form factuality in large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2411.04368

  32. [32]

    GAIA: a benchmark for general AI assistants,

    G. Mialon, C. Fourrier, T. Wolf, Y. LeCun, and T. Scialom, “GAIA: a benchmark for general AI assistants,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. [Online]. Available: https://openreview.net/forum?id=fibxvahvs3

  33. [33]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    J. Wei, Z. Sun, S. Papay, S. McKinney, J. Han, I. Fulford, H. W. Chung, A. T. Passos, W. Fedus, and A. Glaese, “Browsecomp: A simple yet challenging benchmark for browsing agents,”CoRR, vol. abs/2504.12516, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2504.12516

  34. [34]

    ALFRED: A benchmark for interpreting grounded instructions for everyday tasks,

    M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “ALFRED: A benchmark for interpreting grounded instructions for everyday tasks,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, 2020, pp. 10 737–10 746. [On...

  35. [35]

    Alfworld: Aligning text and embodied environments for interactive learning,

    M. Shridhar, X. Yuan, M. Côté, Y. Bisk, A. Trischler, and M. J. Hausknecht, “Alfworld: Aligning text and embodied environments for interactive learning,” in9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. [Online]. Available: https://openreview.net/forum?id=0IOX0YcCdTn

  36. [36]

    Scienceworld: Is your agent smarter than a 5th grader?

    R. Wang, P . A. Jansen, M. Côté, and P . Ammanabrolu, “Scienceworld: Is your agent smarter than a 5th grader?” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds. Association for Computational Linguistics, 20...

  37. [37]

    Gamearena: Evaluating llm reasoning through live computer games,

    L. Hu, Q. Li, A. Xie, N. Jiang, I. Stoica, H. Jin, and H. Zhang, “Gamearena: Evaluating llm reasoning through live computer games,” 2025. [Online]. Available: https: //arxiv.org/abs/2412.06394

  38. [38]

    Baba is ai: Break the rules to beat the benchmark,

    N. Cloos, M. Jens, M. Naim, Y.-L. Kuo, I. Cases, A. Barbu, and C. J. Cueva, “Baba is ai: Break the rules to beat the benchmark,”

  39. [39]

    Available: https://arxiv.org/abs/2407.13729

    [Online]. Available: https://arxiv.org/abs/2407.13729

  40. [40]

    Gamebench: Evaluating strategic reasoning abilities of llm agents,

    A. Costarelli, M. Allen, R. Hauksson, G. Sodunke, S. Hariharan, C. Cheng, W. Li, J. Clymer, and A. Yadav, “Gamebench: Evaluating strategic reasoning abilities of llm agents,” 2024. [Online]. Available: https://arxiv.org/abs/2406.06613

  41. [41]

    Toolllm: Facilitating large language models to master 16000+ real-world apis,

    Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “Toolllm: Facilitating large language models to master 16000+ real-world apis,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7- 11, 202...

  42. [42]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    S. Yao, N. Shinn, P . Razavi, and K. Narasimhan, “τ-bench: A benchmark for tool-agent-user interaction in real-world domains,”CoRR, vol. abs/2406.12045, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2406.12045

  43. [43]

    Api-bank: A comprehensive benchmark for tool-augmented llms,

    M. Li, Y. Zhao, B. Yu, F. Song, H. Li, H. Yu, Z. Li, F. Huang, and Y. Li, “Api-bank: A comprehensive benchmark for tool-augmented llms,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Association for Computational Linguistics, 2023...

  44. [44]

    Program synthesis with large language models,

    J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Do- han, E. Jiang, C. Cai, M. Terry, Q. Leet al., “Program synthesis with large language models,”arXiv preprint arXiv:2108.07732, 2021

  45. [45]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    M. A. Merrill, A. G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J. Y. Shin, T. Walshe, E. K. Buchanan, J. Shen, G. Ye, H. Lin, J. Poulos, M. Wang, M. Nezhurina, J. Jitsev, D. Lu, O. Menis-Mastromichalakis, Z. Xu, Z. Chen, Y. Liu, R. Zhang, L. L. Chen, A. Kashyap, J. Uslu, J. Li, J. Wu, M. Yan, S. Bian, V . Sharma, K. Sun, S. Dillmann, A. Anand...

  46. [46]

    Kernelbench: Can llms write efficient gpu kernels?

    A. Ouyang, S. Guo, S. Arora, A. L. Zhang, W. Hu, C. Ré, and A. Mirhoseini, “Kernelbench: Can llms write efficient gpu kernels?”arXiv preprint arXiv:2502.10517, 2025

  47. [47]

    Medagentbench: Dataset for benchmarking llms as agents in medical applications,

    Y. Jiang, K. C. Black, G. Geng, D. Park, A. Y. Ng, and J. H. Chen, “Medagentbench: Dataset for benchmarking llms as agents in medical applications,”CoRR, vol. abs/2501.14654, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2501.14654

  48. [48]

    Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery,

    Z. Chen, S. Chen, Y. Ning, Q. Zhang, B. Wang, B. Yu, Y. Li, Z. Liao, C. Wei, Z. Lu, V . Dey, M. Xue, F. N. Baker, B. Burns, D. Adu- Ampratwum, X. Huang, X. Ning, S. Gao, Y. Su, and H. Sun, “Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery,” inThe Thirteenth International Conference on Learning Represent...

  49. [49]

    Dsbench: How far are data science agents from becoming data science experts?

    L. Jing, Z. Huang, X. Wang, W. Yao, W. Yu, K. Ma, H. Zhang, X. Du, and D. Yu, “Dsbench: How far are data science agents from becoming data science experts?” inThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. [Online]. Available: https://openreview.net/forum?id=DSsSPr0RZJ

  50. [50]

    Openai gym,

    G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schul- man, J. Tang, and W. Zaremba, “Openai gym,”arXiv preprint arXiv:1606.01540, 2016

  51. [51]

    Agentbench: Evaluating llms as agents,

    X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y. Su, H. Sun, M. Huang, Y. Dong, and J. Tang, “Agentbench: Evaluating llms as agents,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.n...

  52. [52]

    Agentscourt: Building judicial decision-making agents with court debate simulation and legal knowledge augmenta- tion,

    Z. He, P . Cao, C. Wang, Z. Jin, Y. Chen, J. Xu, H. Li, K. Liu, and J. Zhao, “Agentscourt: Building judicial decision-making agents with court debate simulation and legal knowledge augmenta- tion,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 9399–9416

  53. [53]

    Mmr-v: What’s left unsaid? a benchmark for multi- modal deep reasoning in videos,

    K. Zhu, Z. Jin, H. Yuan, J. Li, S. Tu, P . Cao, Y. Chen, K. Liu, and J. Zhao, “Mmr-v: What’s left unsaid? a benchmark for multi- modal deep reasoning in videos,”arXiv preprint arXiv:2506.04141, 2025

  54. [54]

    Mmr-life: Piecing together real-life scenes for multi- modal multi-image reasoning,

    J. Li, S. Huang, Z. Jin, C. Zhang, P . Cao, Y. Chen, K. Liu, and J. Zhao, “Mmr-life: Piecing together real-life scenes for multi- modal multi-image reasoning,”arXiv preprint arXiv:2603.02024, 2026

  55. [55]

    Windows agent arena: Evaluating multi-modal os agents at scale,

    R. Bonatti, D. Zhao, F. Bonacci, D. Dupont, S. Abdali, Y. Li, Y. Lu, J. Wagle, K. Koishida, A. Buckeret al., “Windows agent arena: Evaluating multi-modal os agents at scale,”arXiv preprint arXiv:2409.08264, 2024. JOURNAL OF LATEX CLASS FILES, JANUARY 2025 45

  56. [56]

    Osworld-mcp: Benchmarking MCP tool invocation in computer-use agents,

    H. Jia, J. Liao, X. Zhang, H. Xu, T. Xie, C. Jiang, M. Yan, S. Liu, W. Ye, and F. Huang, “Osworld-mcp: Benchmarking MCP tool invocation in computer-use agents,” CoRR, vol. abs/2510.24563, 2025. [Online]. Available: https: //doi.org/10.48550/arXiv.2510.24563

  57. [57]

    Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction,

    D. Zhang, Z. Shen, R. Xie, S. Zhang, T. Xie, Z. Zhao, S. Chen, L. Chen, H. Xu, R. Caoet al., “Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction,”arXiv preprint arXiv:2305.08144, 2023

  58. [58]

    Android in the wild: A large-scale dataset for android device control,

    C. Rawles, A. Li, D. Rodriguez, O. Riva, and T. P . Lillicrap, “Android in the wild: A large-scale dataset for android device control,”CoRR, vol. abs/2307.10088, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2307.10088

  59. [59]

    Androidworld: A dynamic benchmarking environment for autonomous agents,

    C. Rawles, S. Clinckemaillie, Y. Chang, J. Waltz, G. Lau, M. Fair, A. Li, W. E. Bishop, W. Li, F. Campbell-Ajala, D. K. Toyama, R. J. Berry, D. Tyamagundlu, T. P . Lillicrap, and O. Riva, “Androidworld: A dynamic benchmarking environment for autonomous agents,” inThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, Ap...

  60. [60]

    Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive and mcp-augmented environments,

    Q. Kong, X. Zhang, Z. Yang, N. Gao, C. Liu, P . Tong, C. Cai, H. Zhou, J. Zhang, L. Chen, Z. Liu, S. Hoi, and Y. Wang, “Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive and mcp-augmented environments,”CoRR, vol. abs/2512.19432, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2512.19432

  61. [61]

    Mobile-bench: An evaluation benchmark for llm-based mobile agents,

    S. Deng, W. Xu, H. Sun, W. Liu, T. Tan, L. Liujianfeng, A. Li, J. Luan, B. Wang, R. Yanet al., “Mobile-bench: An evaluation benchmark for llm-based mobile agents,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 8813–8831

  62. [62]

    Mind2web: Towards a generalist agent for the web,

    X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y. Su, “Mind2web: Towards a generalist agent for the web,” inAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Ha...

  63. [63]

    Webarena: A realistic web environment for building autonomous agents,

    S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y. Bisk, D. Fried, U. Alon, and G. Neubig, “Webarena: A realistic web environment for building autonomous agents,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

  64. [64]

    Available: https://openreview.net/forum?id= oKn9c6ytLx

    [Online]. Available: https://openreview.net/forum?id= oKn9c6ytLx

  65. [65]

    2024 , publisher =

    J. Y. Koh, R. Lo, L. Jang, V . Duvvur, M. C. Lim, P . Huang, G. Neubig, S. Zhou, R. Salakhutdinov, and D. Fried, “Visualwebarena: Evaluating multimodal agents on realistic visual web tasks,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, L....

  66. [66]

    Widesearch: Benchmarking agentic broad info- seeking,

    R. Wong, J. Wang, J. Zhao, L. Chen, Y. Gao, L. Zhang, X. Zhou, Z. Wang, K. Xiang, G. Zhang, W. Huang, Y. Wang, and K. Wang, “Widesearch: Benchmarking agentic broad info- seeking,”CoRR, vol. abs/2508.07999, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2508.07999

  67. [67]

    Infodeepseek: Benchmarking agentic information seeking for retrieval- augmented generation,

    Y. Xi, J. Lin, M. Zhu, Y. Xiao, Z. Ou, J. Liu, T. Wan, B. Chen, W. Liu, Y. Wang, R. Tang, W. Zhang, and Y. Yu, “Infodeepseek: Benchmarking agentic information seeking for retrieval- augmented generation,”CoRR, vol. abs/2505.15872, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2505.15872

  68. [68]

    Open data synthesis for deep research,

    Z. Xia, K. Luo, H. Qian, and Z. Liu, “Open data synthesis for deep research,”CoRR, vol. abs/2509.00375, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2509.00375

  69. [69]

    Mmdeepresearch-bench: A benchmark for multimodal deep research agents,

    P . Huang, Z. Zhong, Z. Wan, D. Zhou, S. Alam, X. Wang, Z. Li, Z. Dou, L. Zhu, J. Xiong, C. Tao, Y. Xu, D. Dimitriadis, T. Zhang, and M. Zhang, “Mmdeepresearch-bench: A benchmark for multimodal deep research agents,”CoRR, vol. abs/2601.12346,

  70. [70]
  71. [71]

    Webwalker: Benchmarking llms in web traversal,

    J. Wu, W. Yin, Y. Jiang, Z. Wang, Z. Xi, R. Fang, L. Zhang, Y. He, D. Zhou, P . Xie, and F. Huang, “Webwalker: Benchmarking llms in web traversal,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, W. Che, J. Nabende, E. Shutova, and M. T....

  72. [72]

    Dragged into conflicts: Detecting and addressing conflicting sources in search-augmented llms,

    A. Cattan, A. Jacovi, O. Ram, J. Herzig, R. Aharoni, S. Goldshtein, E. Ofek, I. Szpektor, and A. Caciularu, “Dragged into conflicts: Detecting and addressing conflicting sources in search-augmented llms,”CoRR, vol. abs/2506.08500, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2506.08500

  73. [73]

    BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

    P . Zhou, B. Leon, X. Ying, C. Zhang, Y. Shao, Q. Ye, D. Chong, Z. Jin, C. Xie, M. Cao, Y. Gu, S. Hong, J. Ren, J. Chen, C. Liu, and Y. Hua, “Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese,”CoRR, vol. abs/2504.19314, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2504.19314

  74. [74]

    Characterizing deep research: A benchmark and formal definition,

    A. Java, A. Khandelwal, S. P . Midigeshi, A. Halfaker, A. Deshpande, N. Goyal, A. Gupta, N. Natarajan, and A. Sharma, “Characterizing deep research: A benchmark and formal definition,”CoRR, vol. abs/2508.04183, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2508.04183

  75. [75]

    Omnigaia: Towards native omni-modal AI agents,

    X. Li, W. Jiao, J. Jin, S. Wang, G. Dong, J. Jin, H. Wang, Y. Wang, J. Wen, Y. Lu, and Z. Dou, “Omnigaia: Towards native omni-modal AI agents,”CoRR, vol. abs/2602.22897, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2602.22897

  76. [76]

    DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

    M. Du, B. Xu, C. Zhu, X. Wang, and Z. Mao, “Deepresearch bench: A comprehensive benchmark for deep research agents,” CoRR, vol. abs/2506.11763, 2025. [Online]. Available: https: //doi.org/10.48550/arXiv.2506.11763

  77. [77]

    Multimodal deepresearcher: Generating text-chart interleaved reports from scratch with agentic framework,

    Z. Yang, B. Pan, H. Wang, Y. Wang, X. Liu, M. Zhu, B. Zhang, and W. Chen, “Multimodal deepresearcher: Generating text-chart interleaved reports from scratch with agentic framework,” CoRR, vol. abs/2506.02454, 2025. [Online]. Available: https: //doi.org/10.48550/arXiv.2506.02454

  78. [78]

    Dr. bench: A multidimensional evaluation for deep research agents, from answers to reports,

    Y. Yao, Y. Wang, Y. Zhang, Y. Lu, T. Gu, L. Li, D. Zhao, K. Wu, H. Wang, P . Nie, Y. Teng, and Y. Wang, “Dr. bench: A multidimensional evaluation for deep research agents, from answers to reports,” 2026. [Online]. Available: https://arxiv.org/abs/2510.02190

  79. [79]

    Surveygen: Quality-aware scientific survey generation with large language models,

    T. Bao, M. T. Nayeem, D. Rafiei, and C. Zhang, “Surveygen: Quality-aware scientific survey generation with large language models,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Association for Comput...

  80. [80]

    Reportbench: Evaluating deep research agents via academic survey tasks,

    M. Li, Y. Zeng, Z. Cheng, C. Ma, and K. Jia, “Reportbench: Evaluating deep research agents via academic survey tasks,” CoRR, vol. abs/2508.15804, 2025. [Online]. Available: https: //doi.org/10.48550/arXiv.2508.15804

Showing first 80 references.