Recognition: unknown
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Pith reviewed 2026-05-10 05:20 UTC · model grok-4.3
The pith
Agent-World trains general agents by synthesizing scalable real-world environments and self-evolving them to close capability gaps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Agent-World enables the co-evolution of agent policies and environments through two integrated components: Agentic Environment-Task Discovery, which autonomously synthesizes verifiable tasks with controllable difficulty from topic-aligned databases and executable tool ecosystems, and Continuous Self-Evolving Agent Training, which combines multi-environment reinforcement learning with a self-evolving arena that automatically identifies capability gaps via dynamic task synthesis. Evaluations show that the resulting Agent-World-8B and 14B models consistently outperform proprietary models and environment scaling baselines on 23 benchmarks, with performance scaling according to environment iverse
What carries the argument
The self-evolving agent arena that uses dynamic task synthesis to identify capability gaps and drive targeted learning in combination with multi-environment reinforcement learning.
If this is right
- Agent performance scales positively with increasing environment diversity and additional self-evolution rounds.
- Agents develop the ability to handle stateful, tool-using interactions in real-world services more effectively.
- Life-long learning in agents becomes possible through ongoing, automatic identification of gaps and targeted training.
- General agent intelligence can advance via the co-evolution of policies and environments rather than fixed training sets.
Where Pith is reading between the lines
- This suggests that autonomous synthesis of verifiable tasks could reduce reliance on manually curated datasets for agent training.
- Similar mechanisms might apply to evolving agents in other domains such as simulated physical environments or multi-agent systems.
- A testable extension would involve measuring how well the synthesized tasks generalize to entirely new tool ecosystems not used in training.
Load-bearing premise
The autonomously synthesized tasks are realistic, verifiable, and representative of genuine real-world challenges without introducing artifacts or causing overfitting.
What would settle it
Evaluating the trained agents on a large set of real-world agent tasks that were not part of the synthesis process and finding no performance advantage over baselines would falsify the effectiveness claim.
read the original abstract
Large language models are increasingly expected to serve as general-purpose agents that interact with external, stateful tool environments. The Model Context Protocol (MCP) and broader agent skills offer a unified interface for connecting agents with scalable real-world services, but training robust agents remains limited by the lack of realistic environments and principled mechanisms for life-long learning. In this paper, we present \textbf{Agent-World}, a self-evolving training arena for advancing general agent intelligence through scalable environments. Agent-World has two main components: (1) Agentic Environment-Task Discovery, which autonomously explores topic-aligned databases and executable tool ecosystems from thousands of real-world environment themes and synthesizes verifiable tasks with controllable difficulty; and (2) Continuous Self-Evolving Agent Training, which combines multi-environment reinforcement learning with a self-evolving agent arena that automatically identifies capability gaps through dynamic task synthesis and drives targeted learning, enabling the co-evolution of agent policies and environments. Across 23 challenging agent benchmarks, Agent-World-8B and 14B consistently outperforms strong proprietary models and environment scaling baselines. Further analyses reveal scaling trends in relation to environment diversity and self-evolution rounds, offering insights for building general agent intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Agent-World, a self-evolving training arena for general agent intelligence in LLMs. It consists of two components: (1) Agentic Environment-Task Discovery, which autonomously explores real-world environment themes to synthesize verifiable tasks with controllable difficulty from topic-aligned databases and tool ecosystems; and (2) Continuous Self-Evolving Agent Training, which combines multi-environment reinforcement learning with a self-evolving arena that dynamically identifies capability gaps via task synthesis to enable co-evolution of agent policies and environments. The central empirical claim is that the resulting Agent-World-8B and 14B models consistently outperform strong proprietary models and environment scaling baselines across 23 challenging agent benchmarks, supported by analyses of scaling trends with environment diversity and self-evolution rounds.
Significance. If the performance gains are robustly demonstrated with independent benchmarks and the synthesized tasks prove realistic and free of artifacts, this framework could meaningfully advance scalable training for general-purpose agents by addressing the scarcity of realistic, stateful environments and providing a mechanism for lifelong, gap-driven learning. The emphasis on controllable difficulty and dynamic synthesis offers a promising direction beyond static benchmarks.
major comments (2)
- [Abstract] Abstract: the headline claim that Agent-World-8B and 14B 'consistently outperforms strong proprietary models and environment scaling baselines' on 23 benchmarks is load-bearing for the paper's contribution, yet the abstract (and visible description) supplies no experimental details on baseline definitions, statistical tests, variance across runs, or confirmation that the 23 evaluation benchmarks are independent of the synthesis distribution.
- [Method description] Agentic Environment-Task Discovery and Continuous Self-Evolving Agent Training sections: the assertion that autonomously synthesized tasks are 'verifiable' and that the self-evolving loop 'automatically identifies capability gaps' without introducing artifacts or overfitting is central to the general-intelligence claim, but no concrete verification protocol, distribution-matching metrics, or ablation on gap-identification accuracy is described.
minor comments (2)
- The paper would benefit from a dedicated experiments section with tables reporting per-benchmark scores, baseline names, and statistical significance.
- Clarify how 'real-world themes' are sampled and how tool ecosystems are ensured to be executable without leakage into the 23 evaluation benchmarks.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments highlight important aspects of clarity in the abstract and methodological rigor. We address each point below and will make targeted revisions to strengthen the presentation without altering the core claims or results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim that Agent-World-8B and 14B 'consistently outperforms strong proprietary models and environment scaling baselines' on 23 benchmarks is load-bearing for the paper's contribution, yet the abstract (and visible description) supplies no experimental details on baseline definitions, statistical tests, variance across runs, or confirmation that the 23 evaluation benchmarks are independent of the synthesis distribution.
Authors: We agree that the abstract is concise and could better contextualize the headline claim for readers. The full manuscript defines the baselines explicitly (proprietary models including GPT-4o and Claude-3.5-Sonnet plus environment scaling baselines such as uniform sampling and static dataset training), reports results with standard deviations across five independent runs per model in the main results table and Appendix, and confirms the 23 benchmarks are established, pre-existing agent evaluation suites (e.g., WebArena, ToolBench, OSWorld) with no overlap to the synthesized training distribution. To improve self-containment, we will revise the abstract to include a brief clause summarizing the evaluation protocol and independence of the benchmarks. revision: yes
-
Referee: [Method description] Agentic Environment-Task Discovery and Continuous Self-Evolving Agent Training sections: the assertion that autonomously synthesized tasks are 'verifiable' and that the self-evolving loop 'automatically identifies capability gaps' without introducing artifacts or overfitting is central to the general-intelligence claim, but no concrete verification protocol, distribution-matching metrics, or ablation on gap-identification accuracy is described.
Authors: The manuscript outlines task verifiability through successful execution against the real tool ecosystems and consistency checks with the source databases, and describes gap identification via performance-based triggering on newly generated tasks within the self-evolution loop. However, we acknowledge that more explicit protocols, such as formal distribution-matching metrics between synthesized and real-world task distributions and dedicated ablations measuring gap-identification precision, would address potential concerns about artifacts. We will add these elements, including a verification pseudocode snippet and expanded ablation results, in the revised manuscript. revision: yes
Circularity Check
No significant circularity in claimed derivation
full rationale
The paper describes a methodological framework consisting of Agentic Environment-Task Discovery (autonomous synthesis of verifiable tasks from real-world themes) and Continuous Self-Evolving Agent Training (multi-environment RL with dynamic gap identification). Performance is reported empirically on 23 independent benchmarks, with scaling trends noted in relation to environment diversity and evolution rounds. No equations, fitted parameters, or self-referential derivations are present in the abstract or described components that reduce any claim to its own inputs by construction. The self-evolving loop is a procedural mechanism for task generation and training, not a mathematical identity or fitted prediction that collapses to the input data. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 3 Pith papers
-
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
-
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
-
Learning Agentic Policy from Action Guidance
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
Reference graph
Works this paper leans on
-
[1]
Aime2024, 2024
AIME2024. Aime2024, 2024. URLhttps://huggingface.co/datasets/HuggingFaceH4/aime_2024
2024
-
[2]
Aime2025, 2025
AIME2025. Aime2025, 2025. URLhttps://huggingface.co/datasets/opencompass/AIME2025
2025
-
[3]
Y.Gan,C.Li,J.Xie,L.Wen,M.Purver,andM.Poesio
Pierre Andrews, Amine Benhalloum, Gerard Moreno-Torres Bertran, Matteo Bettini, Amar Budhiraja, Ri- cardo Silveira Cabral, Virginie Do, Romain Froger, Emilien Garreau, Jean-Baptiste Gaya, et al. Are: Scaling up agent environments and evaluations.arXiv preprint arXiv:2509.17158, 2025
-
[4]
Introducing claude sonnet 4.5.https://www.anthropic.com/news/claude-sonnet-4-5, 2025
Anthropic. Introducing claude sonnet 4.5.https://www.anthropic.com/news/claude-sonnet-4-5, 2025
2025
-
[5]
WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks
Hao Bai, Alexey Taymanov, Tong Zhang, Aviral Kumar, and Spencer Whitehead. Webgym: Scaling training environments for visual web agents with realistic tasks, 2026. URLhttps://arxiv.org/abs/2601.02439
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[6]
MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers
Chaithanya Bandi, Ben Hertzberg, Geobio Boo, Tejas Polakam, Jeff Da, Sami Hassaan, Manasi Sharma, Andrew Park, Ernesto Hernandez, Dan Rambado, Ivan Salazar, Rafael Cruz, Chetan Rane, Ben Levin, Brad Kenstler, and Bing Liu. Mcp-atlas: A large-scale benchmark for tool-use competency with real mcp servers, 2026. URL https://arxiv.org/abs/2602.00933
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[7]
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.𝜏2-bench: Evaluating conversa- tional agents in a dual-control environment, 2025. URLhttps://arxiv.org/abs/2506.07982
work page internal anchor Pith review arXiv 2025
-
[8]
Seed2.0 model card: Towards intelligence frontier for real-world complexity
ByteDance Seed. Seed2.0 model card: Towards intelligence frontier for real-world complexity. Techni- cal report, ByteDance, 2025. URL https://seed.bytedance.com/en/seed2. Model card PDF: https://lf3- static.bytednsdoc.com/obj/eden-cn/lapzild-tss/ljhwZthlaukjlkulzlp/seed2/0214/Seed2.0
2025
-
[9]
Shihao Cai, Runnan Fang, Jialong Wu, Baixuan Li, Xinyu Wang, Yong Jiang, Liangcai Su, Liwen Zhang, Wenbiao Yin, Zhen Zhang, Fuli Feng, Pengjun Xie, and Xiaobin Wang. Autoforge: Automated environment synthesis for agentic reinforcement learning, 2025. URLhttps://arxiv.org/abs/2512.22857
-
[10]
A survey of pomdp applications
Anthony R Cassandra. A survey of pomdp applications. InWorking notes of AAAI 1998 fall symposium on planning with partially observable Markov decision processes, volume 1724, 1998
1998
-
[11]
Dive: Scaling diversity in agentic task synthesis for generalizable tool use, 2026
Aili Chen, Chi Zhang, Junteng Liu, Jiangjie Chen, Chengyu Du, Yunji Li, Ming Zhong, Qin Wang, Zhengmao Zhu, Jiayuan Song, Ke Ji, Junxian He, Pengyu Zhao, and Yanghua Xiao. Dive: Scaling diversity in agentic task synthesis for generalizable tool use, 2026. URLhttps://arxiv.org/abs/2603.11076
-
[12]
ACEBench: A comprehensive evaluation of LLM tool usage
Chen Chen, Xinlong Hao, Weiwen Liu, Xu Huang, Xingshan Zeng, Shuai Yu, Dexun Li, Yuefeng Huang, Xiangcheng Liu, Wang Xinzhi, and Wu Liu. ACEBench: A comprehensive evaluation of LLM tool usage. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pa...
-
[13]
Learning to reason with search for llms via reinforcement learning,
Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. Research: Learning to reason with search for llms via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2503.19470
-
[14]
arXiv preprint arXiv:2501.15228 , year=
Yiqun Chen, Lingyong Yan, Weiwei Sun, Xinyu Ma, Yi Zhang, Shuaiqiang Wang, Dawei Yin, Yiming Yang, and Jiaxin Mao. Improving retrieval-augmented generation through multi-agent reinforcement learning.arXiv preprint arXiv:2501.15228, 2025
-
[15]
Arc- agi-2: A new challenge for frontier ai reasoning systems
Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems, 2026. URLhttps://arxiv.org/abs/2505.11831
-
[16]
Claw-eval: End-to-end transparent benchmark for ai agents in the real world.https://github.com/ claw-eval/claw-eval, 2026
claw-eval. Claw-eval: End-to-end transparent benchmark for ai agents in the real world.https://github.com/ claw-eval/claw-eval, 2026. GitHub repository
2026
-
[17]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URLhttps://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[18]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. CoRR, abs/2501.12948, 2025. 20
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, and Jingren Zhou. Self- play with execution feedback: Improving instruction-following capabilities of large language models.CoRR, abs/2406.13542, 2024. doi: 10.48550/ARXIV.2406.13542. URLhttps://doi.org/10.48550/arXiv.2406.13542
-
[20]
Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou, and Ji-Rong Wen. Tool-star: Empowering llm-brained multi-tool reasoner via reinforcement learning.CoRR, abs/2505.16410, 2025. doi: 10.48550/ARXIV.2505.16410. URLhttps://doi.org/10.48550/arXiv.2505.16410
-
[21]
arXiv preprint arXiv:2507.19849 , year=
Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Agentic reinforced policy optimization.CoRR, abs/2507.19849, 2025. doi: 10.48550/ARXIV.2507.19849. URLhttps: //doi.org/10.48550/arXiv.2507.19849
-
[22]
Toward generalized web agent training: A deep dive into entropy-balanced reinforcement learning
Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jinghan Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Toward generalized web agent training: A deep dive into entropy-balanced reinforcement learning. InProceedings of the ACM Web Conference 2026, WWW ’26, page 2126–2137, New...
-
[23]
Gemini 3 pro: the frontier of vision ai.https://blog.google/innovation-and-ai/technology/ developers-tools/gemini-3-pro-vision/, 2025
Rohan Doshi. Gemini 3 pro: the frontier of vision ai.https://blog.google/innovation-and-ai/technology/ developers-tools/gemini-3-pro-vision/, 2025
2025
-
[25]
Towards general agentic intelligence via environment scaling
Runnan Fang, Shihao Cai, Baixuan Li, Jialong Wu, Guangyu Li, Wenbiao Yin, Xinyu Wang, Xiaobin Wang, Liangcai Su, Zhen Zhang, et al. Towards general agentic intelligence via environment scaling.arXiv preprint arXiv:2509.13311, 2025
-
[26]
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms, 2025. URLhttps: //arxiv.org/abs/2504.11536
work page internal anchor Pith review arXiv 2025
-
[27]
Jichen Feng, Yifan Zhang, Chenggong Zhang, Yifu Lu, Shilong Liu, and Mengdi Wang. Web world models, 2025. URLhttps://arxiv.org/abs/2512.23676
-
[28]
Group-in-Group Policy Optimization for LLM Agent Training
Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training. arXiv preprint arXiv:2505.10978, 2025
work page internal anchor Pith review arXiv 2025
-
[29]
Large language model-based human-agent collaboration for complex task solving
Xueyang Feng, Zhi-Yuan Chen, Yujia Qin, Yankai Lin, Xu Chen, Zhiyuan Liu, and Ji-Rong Wen. Large language model-based human-agent collaboration for complex task solving. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 1336–1357, 2024
2024
-
[30]
Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl, 2025
Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu. Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl, 2025. URLhttps://arxiv.org/ abs/2508.07976
-
[32]
Jiacheng Guo, Ling Yang, Peter Chen, Qixin Xiao, Yinjie Wang, Xinzhe Juan, Jiahao Qiu, Ke Shen, and Mengdi Wang. Genenv: Difficulty-aligned co-evolution between llm agents and environment simulators, 2025. URL https://arxiv.org/abs/2512.19682
-
[33]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024. URLhttps: //arxiv.org/abs/2402.14008
work page internal anchor Pith review arXiv 2024
-
[34]
Openclaw as language infrastructure: A case-centered survey of a public agent ecosystem in the wild
Chaoyue He, Xin Zhou, Di Wang, Hong Xu, Wei Liu, and Chunyan Miao. Openclaw as language infrastructure: A case-centered survey of a public agent ecosystem in the wild. 2026
2026
-
[35]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URLhttps://arxiv.org/abs/2009.03300. 21
work page internal anchor Pith review arXiv 2021
-
[36]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URLhttps://arxiv.org/ abs/2103.03874
work page internal anchor Pith review arXiv 2021
-
[37]
Model context protocol (mcp): Landscape, security threats, and future research directions.ACM Transactionson Software Engineering and Methodology, 2025
Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. Model context protocol (mcp): Landscape, security threats, and future research directions.ACM Transactionson Software Engineering and Methodology, 2025
2025
-
[38]
Scaling environments for llm agents in the era of learning from interaction: A survey
Yuchen Huang, Sijia Li, Wei Liu, Zhiyuan Fan, Yi R Fung, et al. Scaling environments for llm agents in the era of learning from interaction: A survey. InWorkshopon Scaling Environments for Agents, 2025
2025
-
[39]
Reinforcement learning with rubric anchors
Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, Xijun Gu, Peiyi Tu, Jiaxin Liu, Wenyu Chen, Yuzhuo Fu, Zhiting Fan, Yanmei Gu, Yuanyuan Wang, Zhengkai Yang, Jianguo Li, and Junbo Zhao. Reinforcement learning with rubric anchors.CoRR, abs/2508.12790, 2025. doi: 10.48550/ARXIV.2508.127...
- [40]
-
[41]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. URLhttps://arxiv.org/abs/2310. 06770
2024
-
[42]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.CoRR, abs/2503.09516, 2025
work page Pith review arXiv 2025
-
[43]
Decoupled planning and execution: A hierarchical reasoning framework for deep search, 2025
Jiajie Jin, Xiaoxi Li, Guanting Dong, Yuyao Zhang, Yutao Zhu, Yang Zhao, Hongjin Qian, and Zhicheng Dou. Decoupled planning and execution: A hierarchical reasoning framework for deep search, 2025. URL https://arxiv.org/abs/2507.02652
-
[44]
Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Michelini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Graham Neubig, and Junxian He. The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon t...
-
[45]
Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025
Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, and Jingren Zhou. Websailor: Navigating super-human reasoning for web agent.CoRR, abs/2507.02592, 2025. doi: 10.48550/ARXIV.2507.02592....
-
[46]
arXiv preprint arXiv:2508.13167 , year=
Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qianben Chen, Weichen Sun, Qiexiang Wang, Hongxuan Lu, Tianrui Qin, Chenghao Zhu, Yi Yao, Shuying Fan, Xiaowan Li, Tiannan Wang, Pai Liu, King Zhu, He Zhu, Dingfeng Shi, Piaohong Wang, Yeyi Guan, Xiangru Tang, Minghao Liu, Yuchen Eleanor Jiang, Jian Yang, Jiahen...
-
[47]
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, X...
work page internal anchor Pith review arXiv 2026
-
[48]
arXiv preprint arXiv:2510.21618 , year=
Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Guanting Dong, Jiajie Jin, Yinuo Wang, Hao Wang, Yutao Zhu, Ji-Rong Wen, Yuan Lu, and Zhicheng Dou. Deepagent: A general reasoning agent with scalable toolsets, 2025. URL https://arxiv.org/abs/2510.21618
-
[49]
Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability.arXiv preprint arXiv:2504.21776, 2025
-
[50]
Omnigaia: Towards native omni-modal ai agents, 2026
Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Shijian Wang, Guanting Dong, Jiajie Jin, Hao Wang, Yinuo Wang, Ji-Rong Wen, Yuan Lu, and Zhicheng Dou. Omnigaia: Towards native omni-modal ai agents, 2026. URL https://arxiv.org/abs/2602.22897. 22
-
[51]
Torl: Scaling tool-integrated rl, 2025 b
Xuefeng Li, Haoyang Zou, and Pengfei Liu. Torl: Scaling tool-integrated RL.CoRR, abs/2503.23383, 2025. doi: 10.48550/ARXIV.2503.23383. URLhttps://doi.org/10.48550/arXiv.2503.23383
-
[52]
Yixia Li, Hongru Wang, Jiahao Qiu, Zhenfei Yin, Dongdong Zhang, Cheng Qian, Zeping Li, Pony Ma, Guanhua Chen, Heng Ji, and Mengdi Wang. From word to world: Can large language models be implicit text-based world models?, 2025. URLhttps://arxiv.org/abs/2512.18832
-
[53]
arXiv preprint arXiv:2508.17445 , year=
Yizhi Li, Qingshui Gu, Zhoufutu Wen, Ziniu Li, Tianshun Xing, Shuyue Guo, Tianyu Zheng, Xin Zhou, Xingwei Qu, Wangchunshu Zhou, et al. Treepo: Bridging the gap of policy optimization and efficacy and inference efficiency with heuristic tree-based modeling.arXiv preprint arXiv:2508.17445, 2025
-
[55]
Simulating environments with reasoning models for agent training.arXiv preprint arXiv:2511.01824,
Yuetai Li, Huseyin A Inan, Xiang Yue, Wei-Ning Chen, Lukas Wutschitz, Janardhan Kulkarni, Radha Poovendran, Robert Sim, and Saravan Rajmohan. Simulating environments with reasoning models for agent training, 2025. URLhttps://arxiv.org/abs/2511.01824
-
[56]
SkillNet: Create, evaluate, and connect AI skills,
Yuan Liang, Ruobin Zhong, Haoming Xu, Chen Jiang, Yi Zhong, Runnan Fang, Jia-Chen Gu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Xin Xu, Tongtong Wu, Kun Wang, Yang Liu, Zhen Bi, Jungang Lou, Yuchen Eleanor Jiang, Hangcheng Zhu, Gang Yu, Haiwen Hong, Longtao Huang, Hui Xue, Chenxi Wang, Yijun Wang, Zifei Shan, Xi Chen, Zhaopeng Tu, Feiyu Xiong, X...
-
[57]
Let’s verify step by step, 2023
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023. URLhttps://arxiv.org/abs/2305. 20050
2023
-
[58]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025
work page internal anchor Pith review arXiv 2025
-
[59]
Harness engineering: leveraging codex in an agent-first world.https://openai.com/index/ harness-engineering/, feb 2026
Ryan Lopopolo. Harness engineering: leveraging codex in an agent-first world.https://openai.com/index/ harness-engineering/, feb 2026. OpenAI Engineering Blog. Accessed: 2026-04-06
2026
- [60]
-
[61]
Personax: A recommendation agent-oriented user modeling framework for long behavior sequence
Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, and Ruoming Pang. ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Findings of the Association for Computational ...
-
[62]
Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, and Junnan Li. Mcp-universe: Benchmarking large language models with real-world model context protocol servers.arXiv preprint arXiv:2508.14704, 2025
-
[63]
arXiv preprint arXiv:2410.06526 , year=
Kaijing Ma, Xinrun Du, Yunran Wang, Haoran Zhang, Zhoufutu Wen, Xingwei Qu, Jian Yang, Jiaheng Liu, Minghao Liu, Xiang Yue, Wenhao Huang, and Ge Zhang. Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks, 2025. URLhttps://arxiv.org/abs/2410.06526
-
[64]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...
work page internal anchor Pith review arXiv 2026
-
[65]
GAIA: a benchmark for general AI assistants
Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URLhttps://openreview.net/forum?id=fibxvahvs3
2024
-
[66]
Model context protocol specification
Model Context Protocol. Model context protocol specification. https://modelcontextprotocol.io/ specification/latest, 2025. Accessed: 2026-04-06
2025
-
[67]
Yutao Mou, Zhangchi Xue, Lijun Li, Peiyang Liu, Shikun Zhang, Wei Ye, and Jing Shao. Toolsafe: Enhancing tool invocation safety of llm-based agents via proactive step-level guardrail and feedback.arXiv preprint arXiv:2601.10156, 2026
-
[68]
Learning to reason with llms, September 2024
OpenAI. Learning to reason with llms, September 2024. URL https://openai.com/index/ learning-to-reason-with-llms
2024
-
[69]
Introducing gpt-5.2.https://openai.com/zh-Hans-CN/index/introducing-gpt-5-2/, 2025
OpenAI. Introducing gpt-5.2.https://openai.com/zh-Hans-CN/index/introducing-gpt-5-2/, 2025
2025
-
[70]
OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan Cook, Marat Dukhan, Casey Dvorak, Kevin Fives, V...
work page internal anchor Pith review arXiv 2025
-
[71]
Natural-Language Agent Harnesses
Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, and Hai-Tao Zheng. Natural-language agent harnesses, 2026. URLhttps://arxiv.org/abs/2603.25723
-
[72]
Generative Agents: Interactive Simulacra of Human Behavior
Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023. URLhttps://arxiv.org/abs/2304.03442
work page internal anchor Pith review arXiv 2023
-
[73]
Gonzalez
Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models. In Forty-secondInternational Conference on Machine Learning, 2025. URLhttps://openreview.net/ forum?id=2GmDdhBdDk
2025
-
[74]
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025
work page internal anchor Pith review arXiv 2025
-
[75]
Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay
Akshara Prabhakar, Zuxin Liu, Ming Zhu, Jianguo Zhang, Tulika Awalgaonkar, Shiyu Wang, Zhiwei Liu, Haolin Chen, Thai Hoang, Juan Carlos Niebles, Shelby Heinecke, Weiran Yao, Huan Wang, Silvio Savarese, and Caiming Xiong. Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay, 2025. URLhttps://arxiv.org/abs/2504.03601
-
[76]
ToolRL: Reward is All Tool Learning Needs
Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025. 24
work page internal anchor Pith review arXiv 2025
-
[77]
V-oracle: Making progressive reasoning in deciphering oracle bones for you and me
Runqi Qiao, Qiuna Tan, Guanting Dong, MinhuiWu MinhuiWu, Jiapeng Wang, Yifan Zhang, Zhuoma GongQue, Chong Sun, Yida Xu, Yadong Xue, et al. V-oracle: Making progressive reasoning in deciphering oracle bones for you and me. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 20124–20150, 2025
2025
-
[78]
Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Jiapeng Wang, Zhuoma Gongque, ShanglinLei, YifanZhang, ZheWei, MiaoxuanZhang, RunfengQiao, XiaoZong, YidaXu, PeiqingYang, Zhimin Bao, Muxi Diao, Chen Li, and Honggang Zhang. We-math: Does your large multimodal model achieve human-likemathematicalreasoning? InWanxiangChe, JoyceNabe...
2025
-
[79]
Runqi Qiao, Qiuna Tan, Peiqing Yang, Yanzi Wang, Xiaowan Wang, Enhui Wan, Sitong Zhou, Guanting Dong, Yuchen Zeng, Yida Xu, Jie Wang, Chong Sun, Chen Li, and Honggang Zhang. We-math 2.0: A versatile mathbook system for incentivizing visual mathematical reasoning.CoRR, abs/2508.10433, 2025. doi: 10.48550/ARXIV.2508.10433. URLhttps://doi.org/10.48550/arXiv....
-
[80]
Simworld: An open-ended realistic simulator for autonomous agents in physical and social worlds,
Jiawei Ren, Yan Zhuang, Xiaokang Ye, Lingjun Mao, Xuhong He, Jianzhi Shen, Mrinaal Dogra, Yiming Liang, Ruixuan Zhang, Tianai Yue, Yiqing Yang, Eric Liu, Ryan Wu, Kevin Benavente, Rajiv Mandya Nagaraju, Muhammad Faayez, Xiyan Zhang, Dhruv Vivek Sharma, Xianrui Zhong, Ziqiao Ma, Tianmin Shu, Zhiting Hu, and Lianhui Qin. Simworld: An open-ended realistic si...
-
[81]
Bytedance Seed. Seed2. 0 model card: Towards intelligence frontier for real-world complexity. Technical report, Technical report, Bytedance, 2025. URL https://lf3-static. bytednsdoc. com ..., 2026
2025
-
[82]
Seed1.8 Model Card: Towards Generalized Real-World Agency
Bytedance Seed. Seed1.8 model card: Towards generalized real-world agency, 2026. URLhttps://arxiv.org/ abs/2603.20633
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[83]
Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David A. Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, and Pang Wei Koh. DR tulu: Reinforcement learning with...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.