Recognition: 2 theorem links
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
Pith reviewed 2026-05-13 10:08 UTC · model grok-4.3
The pith
UI-TARS-2 reaches 88.2 on Online-Mind2Web and 59.8 mean game score by training a native GUI agent with multi-turn reinforcement learning and a data flywheel.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UI-TARS-2 achieves 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, 73.3 on AndroidWorld, and a mean normalized score of 59.8 across a 15-game suite by applying a data flywheel, stabilized multi-turn RL, hybrid GUI environment, and unified sandbox; the same system also shows competitive results with frontier models on LMGame-Bench and extends to information-seeking and software-engineering benchmarks.
What carries the argument
The stabilized multi-turn reinforcement learning framework together with a data flywheel for scalable data generation and a hybrid GUI environment that adds file-system and terminal access inside a unified sandbox.
If this is right
- Outperforms Claude and OpenAI agents on multiple GUI benchmarks while remaining competitive with OpenAI o3 on game suites.
- Generalizes to long-horizon information-seeking tasks and software-engineering benchmarks without task-specific retraining.
- Yields training-dynamics insights that support stable and efficient large-scale agent reinforcement learning.
- Maintains roughly 60 percent of human-level performance across the 15-game evaluation suite.
Where Pith is reading between the lines
- The hybrid environment may be the main factor that allows training signals from file and terminal actions to improve pure GUI performance.
- Continued scaling of the data flywheel could close more of the remaining gap to human performance on long-horizon tasks.
- The same combination of multi-turn RL and sandbox rollouts might transfer to non-GUI agent settings such as web navigation or code execution agents.
Load-bearing premise
The hybrid GUI environment and unified sandbox produce training and evaluation conditions that are stable and representative enough for the observed gains to transfer to real-world interactive scenarios outside the controlled benchmarks.
What would settle it
Running UI-TARS-2 on a fresh collection of real desktop and mobile tasks that lie outside the provided benchmarks and sandbox and measuring whether the reported score margins over prior models and proprietary agents are preserved.
read the original abstract
The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability. In this technical report, we present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology: a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment that integrates file systems and terminals, and a unified sandbox platform for large-scale rollouts. Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5. On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines such as Claude and OpenAI agents. In game environments, it attains a mean normalized score of 59.8 across a 15-game suite-roughly 60% of human-level performance-and remains competitive with frontier proprietary models (e.g., OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to long-horizon information-seeking tasks and software engineering benchmarks, highlighting its robustness across diverse agent tasks. Detailed analyses of training dynamics further provide insights into achieving stability and efficiency in large-scale agent RL. These results underscore UI-TARS-2's potential to advance the state of GUI agents and exhibit strong generalization to real-world interactive scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents UI-TARS-2, a native GUI-centered agent model trained via a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment integrating file systems and terminals, and a unified sandbox for large-scale rollouts. It reports substantial gains over UI-TARS-1.5, including 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, 73.3 on AndroidWorld, and a mean normalized score of 59.8 across a 15-game suite, while claiming outperformance over baselines such as Claude and OpenAI agents and generalization to long-horizon and software engineering tasks.
Significance. If the performance gains prove attributable to the multi-turn RL stabilization and data flywheel rather than the expanded hybrid action space, the work would offer a useful empirical advance in scalable GUI agent training, providing benchmark numbers that can serve as reference points for future native agent models. The multi-benchmark evaluation and training dynamics analysis add value, though the hybrid setup's role requires clarification to support claims of GUI-centric generalization.
major comments (3)
- [Hybrid GUI environment and unified sandbox] Hybrid GUI environment description: the central claim of advancing GUI agents rests on the hybrid integration of file systems and terminals, yet no ablation isolates their contribution from pure GUI actions; without this, the reported deltas (e.g., 88.2 on Online-Mind2Web, 47.5 on OSWorld) may reflect richer action spaces rather than improved perception-reasoning loops, weakening the generalization assertion to real-world GUI scenarios.
- [Empirical evaluation] Empirical evaluation and results: benchmark scores are presented without error bars, standard deviations, number of evaluation runs, or data-exclusion criteria; this absence makes it impossible to assess statistical reliability of outperformance claims over UI-TARS-1.5 and proprietary baselines.
- [Training methodology] Multi-turn RL framework: the stabilized multi-turn RL is positioned as a core methodological advance, but the manuscript supplies no ablation on its components (e.g., reward shaping or turn-length handling) or concrete hyperparameters, leaving the source of training stability and the 59.8 game-suite score opaque.
minor comments (2)
- A consolidated table comparing all reported benchmarks against baselines (including UI-TARS-1.5, Claude, and OpenAI agents) would improve readability of the performance claims.
- The game-environment normalization procedure and the exact composition of the 15-game suite should be specified to allow direct replication of the 59.8 mean score.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our technical report. We address each major point below with honest responses based on the current manuscript. Where the comments identify gaps, we commit to revisions that strengthen the paper without overstating what was originally presented.
read point-by-point responses
-
Referee: Hybrid GUI environment description: the central claim of advancing GUI agents rests on the hybrid integration of file systems and terminals, yet no ablation isolates their contribution from pure GUI actions; without this, the reported deltas (e.g., 88.2 on Online-Mind2Web, 47.5 on OSWorld) may reflect richer action spaces rather than improved perception-reasoning loops, weakening the generalization assertion to real-world GUI scenarios.
Authors: We agree that the manuscript does not contain an explicit ablation isolating the hybrid file-system and terminal components from pure GUI actions. The hybrid environment is presented as an integrated part of the unified sandbox to support realistic long-horizon tasks that require non-GUI operations, which aligns with our generalization claims. However, without dedicated ablations, attribution of the performance deltas remains correlational rather than causal. We will revise the manuscript to include a dedicated limitations paragraph and, where feasible, preliminary comparative runs that clarify the incremental value of the hybrid actions. revision: partial
-
Referee: Empirical evaluation and results: benchmark scores are presented without error bars, standard deviations, number of evaluation runs, or data-exclusion criteria; this absence makes it impossible to assess statistical reliability of outperformance claims over UI-TARS-1.5 and proprietary baselines.
Authors: The referee is correct that the initial submission omitted error bars, standard deviations, run counts, and exclusion criteria. These statistics were collected during evaluation but not reported. We will add them to all main benchmark tables in the revision, along with explicit statements on the number of independent runs and any data filtering applied, to allow proper assessment of statistical reliability. revision: yes
-
Referee: Multi-turn RL framework: the stabilized multi-turn RL is positioned as a core methodological advance, but the manuscript supplies no ablation on its components (e.g., reward shaping or turn-length handling) or concrete hyperparameters, leaving the source of training stability and the 59.8 game-suite score opaque.
Authors: We acknowledge that the manuscript describes the stabilized multi-turn RL framework at a high level but does not provide component ablations (e.g., on reward shaping or turn-length handling) or a full hyperparameter table. These details exist in our internal training logs but were not included in the submitted version. We will add both the requested ablations and a comprehensive hyperparameter appendix in the revision to make the sources of stability and the 59.8 score transparent. revision: yes
Circularity Check
No circularity: empirical benchmark results are independent of internal definitions
full rationale
The paper presents a training methodology (data flywheel, multi-turn RL, hybrid environment, unified sandbox) and reports performance numbers on external public benchmarks (Online-Mind2Web 88.2, OSWorld 47.5, etc.). No equations, fitted parameters, or self-citations are shown that reduce these scores to quantities defined inside the training loop by construction. The central claims rest on measured deltas against independent baselines rather than renaming or self-referential derivations, satisfying the self-contained criterion.
Axiom & Free-Parameter Ledger
free parameters (1)
- multi-turn RL hyperparameters
axioms (1)
- domain assumption Benchmark scores on Online-Mind2Web, OSWorld, WindowsAgentArena, AndroidWorld, and the 15-game suite accurately reflect real-world GUI agent performance.
Forward citations
Cited by 26 Pith papers
-
S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images
S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
-
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.
-
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
-
Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?
VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.
-
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient rep...
-
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
An exploration-aware RL framework lets LLM agents adaptively explore only under high uncertainty via variational rewards and action grouping, yielding consistent gains on text and GUI agent benchmarks.
-
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
-
Faithful Mobile GUI Agents with Guided Advantage Estimator
Faithful-Agent raises Trap SR in GUI agents from 13.88% to 80.21% via faithfulness-oriented SFT and GuAE-enhanced RFT with consistency rewards while retaining general performance.
-
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management
RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
-
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...
-
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
-
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
ReVision reduces visual tokens in computer-use agent histories by 46% on average and raises success rates by 3% by learning to drop redundant patches across screenshots, allowing longer histories to keep improving per...
-
How Mobile World Model Guides GUI Agents?
Mobile world models in text, image, and code modalities reach state-of-the-art on their benchmarks and improve downstream GUI agent performance, with code best for in-distribution accuracy and text more robust for out...
-
SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents
SnapGuard detects prompt injection attacks on screenshot-based web agents via visual stability indicators and contrast-polarity textual signals, reaching F1 0.75 while running 8x faster than GPT-4o with no added memory cost.
-
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
-
AgentLens: Adaptive Visual Modalities for Human-Agent Interaction in Mobile GUI Agents
AgentLens adaptively deploys Full UI, Partial UI, and GenUI modalities with virtual display overlays for mobile GUI agents, yielding 85.7% user preference and best-in-study usability in a 21-participant evaluation.
-
Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents
Closed-loop VLM agents using multi-view reasoning, object-centered visualization, and single-axis rotation prediction achieve superior text-guided 6D pose rearrangement for target objects in scenes.
-
Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection
Semantic-level UI Element Injection distracts GUI agents by overlaying safety-aligned UI elements, achieving up to 4.4x higher attack success rates that transfer across models and create persistent attractors.
-
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
-
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
An exploration-aware policy optimization method lets LLM agents explore selectively via a variational-inference reward and action grouping, yielding consistent gains on text and GUI agent benchmarks.
-
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length
Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.
-
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...
-
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents
HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.
-
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...
-
Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward
The paper surveys agent skills for LLMs across architecture, acquisition, deployment, and security, proposing a four-tier Skill Trust and Lifecycle Governance Framework to address vulnerabilities in community skills.
-
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.
Reference graph
Works this paper leans on
-
[1]
Introducing the model context protocol, 2024
Anthropic. Introducing the model context protocol, 2024. URL https://www.anthropic.com/news/ model-context-protocol
work page 2024
-
[2]
Developing a computer use model.https://www.anthropic.com/news/developing-computer-use,
Anthropic. Developing a computer use model.https://www.anthropic.com/news/developing-computer-use,
-
[3]
Product announcement
- [4]
-
[5]
Claude’s extended thinking, 2025
anthropic. Claude’s extended thinking, 2025. URL https://www.anthropic.com/news/ visible-extended-thinking
work page 2025
-
[6]
anthropic. Introducing claude 4, 2025. URLhttps://www.anthropic.com/news/claude-4
work page 2025
-
[7]
Scaling data collection for training software engineering agents.Nebius blog, 2024
Ibragim Badertdinov, Maria Trofimova, Yury Anapolskiy, Sergey Abramov, Karina Zainullina, Alexander Golubev, Sergey Polezhaev, Daria Litvintseva, Simon Karasik, Filipp Fisin, Sergey Skvortsov, Maxim Nekrashevich, Anton Shevtsov, and Boris Yangel. Scaling data collection for training software engineering agents.Nebius blog, 2024
work page 2024
-
[8]
Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents, 2025. URLhttps://arxiv.org/ abs/2505.20411
-
[9]
Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos.Advances in Neural Information Processing Systems, 35:24639–24654, 2022
work page 2022
-
[10]
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents.Journal of artificial intelligence research, 47:253–279, 2013
work page 2013
-
[11]
Windows agent arena: Evaluating multi-modal os agents at scale
Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, and Zack Hui. Windows agent arena: Evaluating multi-modal os agents at scale. September 2024
work page 2024
-
[12]
ByteDance. Seed-thinking-1.6, 2025. URLhttps://seed.bytedance.com/zh/seed1_6
work page 2025
-
[13]
Mindsearch: Mimicking human minds elicits deep ai searcher
Zehui Chen, Kuikun Liu, Qiuchen Wang, Jiangning Liu, Wenwei Zhang, Kai Chen, and Feng Zhao. Mindsearch: Mimicking human minds elicits deep ai searcher, 2024. URLhttps://arxiv.org/abs/2407.20183
-
[14]
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024
-
[15]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv e-prints, pages arXiv–2409, 2024
work page 2024
-
[17]
Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su
Xiang Deng, Kelvin Guu, Panupong Pasupat, Afra Akyürek, Sheng Zhuang, Wenlong Chen, Tatsunori Hashimoto, Kelvin Guu, and Percy Liang. Mind2web: Towards a generalist agent for the web. InNeurIPS Datasets and Benchmarks, 2023. URLhttps://arxiv.org/abs/2306.06070
-
[18]
Minedojo: Building open-ended embodied agents with internet-scale knowledge
Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advancesin Neural Information Processing Systems, 35:18343–18362, 2022
work page 2022
-
[19]
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms, 2025. URLhttps: //arxiv.org/abs/2504.11536
work page internal anchor Pith review arXiv 2025
-
[20]
Tora: A tool-integrated reasoning agent for mathematical problem solving
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving, 2024. URLhttps://arxiv.org/abs/ 2309.17452. 24
-
[21]
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Owl: A large language model for it operations, 2024
Hongcheng Guo, Jian Yang, Jiaheng Liu, Liqun Yang, Linzheng Chai, Jiaqi Bai, Junran Peng, Xiaorong Hu, Chao Chen, Dongfeng Zhang, Xu Shi, Tieqiao Zheng, Liangfan Zheng, Bo Zhang, Ke Xu, and Zhoujun Li. Owl: A large language model for it operations, 2024. URLhttps://arxiv.org/abs/2309.09298
-
[24]
Cogagent: A visual language model for gui agents
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024
work page 2024
-
[25]
Auke Jan Ijspeert, Jun Nakanishi, Heiko Hoffmann, Peter Pastor, and Stefan Schaal
Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang. lmgame-bench: How good are llms at playing games?, 2025. URLhttps://arxiv.org/abs/ 2505.15146
-
[26]
Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, et al. Os agents: A survey on mllm-based agents for general computing devices use.arXiv preprint arXiv:2508.04482, 2025
-
[27]
Lisheng Huang, Yichen Liu, Jinhao Jiang, Rongxiang Zhang, Jiahao Yan, Junyi Li, and Wayne Xin Zhao. Manusearch: Democratizing deep search in large language models with a transparent and open multi-agent framework, 2025. URLhttps://arxiv.org/abs/2505.18105
-
[28]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, et al. MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning.arXiv preprint arXiv:2205.00445, 2022
work page internal anchor Pith review arXiv 2022
-
[31]
Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025
Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, and Jingren Zhou. Websailor: Navigating super-human reasoning for web agent, 2025. URL https://arxiv.org/abs/2507.02592
-
[32]
Muyao Li, Zihao Wang, Kaichen He, Xiaojian Ma, and Yitao Liang. Jarvis-vla: Post-training large-scale vision language models to play visual games with keyboards and mouse.arXiv preprint arXiv:2503.16365, 2025
-
[33]
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models, 2025. URLhttps://arxiv.org/abs/2501.05366
work page internal anchor Pith review arXiv 2025
-
[34]
Torl: Scaling tool-integrated rl, 2025 b
Xuefeng Li, Haoyang Zou, and Pengfei Liu. Torl: Scaling tool-integrated rl, 2025. URLhttps://arxiv.org/abs/ 2503.23383
-
[35]
Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, et al. Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990, 2025
-
[36]
Fanbin Lu, Zhisheng Zhong, Shu Liu, Chi-Wing Fu, and Jiaya Jia. Arpo: End-to-end policy optimization for gui agents with experience replay.arXiv preprint arXiv:2505.16282, 2025
-
[37]
Qinyu Luo, Yining Ye, Shihao Liang, Zhong Zhang, Yujia Qin, Yaxi Lu, Yesai Wu, Xin Cong, Yankai Lin, Yingli Zhang, Xiaoyin Che, Zhiyuan Liu, and Maosong Sun. Repoagent: An llm-powered open-source framework for repository-level code documentation generation.arXiv preprint arXiv:2402.16667, 2024. URL https://arxiv. org/abs/2402.16667
-
[38]
Weiyu Ma, Qirui Mi, Yongcheng Zeng, Xue Yan, Runji Lin, Yuqiao Wu, Jun Wang, and Haifeng Zhang. Large language models play starcraft ii: Benchmarks and a chain of summarization approach.Advances in Neural Information Processing Systems, 37:133386–133442, 2024. 25
work page 2024
-
[39]
Human-level control through deep reinforcement learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015
work page 2015
-
[40]
Kimi-researcher: End-to-end rl training for emerging agentic capabilities.https://moonshotai
MoonshotAI. Kimi-researcher: End-to-end rl training for emerging agentic capabilities.https://moonshotai. github.io/Kimi-Researcher/, 2025
work page 2025
-
[41]
Gui agents: A survey.arXiv preprint arXiv:2412.13501, 2024
Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, et al. Gui agents: A survey.arXiv preprint arXiv:2412.13501, 2024
-
[42]
OpenAI: Introducing ChatGPT, 2022
OpenAI. OpenAI: Introducing ChatGPT, 2022. URLhttps://openai.com/blog/chatgpt
work page 2022
-
[43]
OpenAI. Introducing gpt 5, 2025. URLhttps://openai.com/index/introducing-gpt-5/
work page 2025
-
[44]
Introducing deep research - openai.https://openai.com/index/introducing-deep-research/, 2025
OpenAI. Introducing deep research - openai.https://openai.com/index/introducing-deep-research/, 2025
work page 2025
-
[45]
Openai o3 and o4-mini system card
OpenAI. Openai o3 and o4-mini system card. https://cdn.openai.com/pdf/ 2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf/, 2025
work page 2025
-
[46]
OpenAI. Computer-using agent (cua). https://openai.com/index/computer-using-agent/, 2025. Research preview / blog
work page 2025
-
[47]
openai. Operator, 2025. URLhttps://openai.com/index/introducing-operator/
work page 2025
-
[48]
Training software engineering agents and verifiers with swe-gym,
Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym, 2025. URLhttps://arxiv.org/abs/2412.21139
-
[49]
Exploring mode connectivity for pre-trained language models
Yujia Qin, Cheng Qian, Jing Yi, Weize Chen, Yankai Lin, Xu Han, Zhiyuan Liu, Maosong Sun, and Jie Zhou. Exploring mode connectivity for pre-trained language models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6726–6746, Abu Dhabi, United Arab Emir...
-
[50]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents. arXiv preprint arXiv:2501.12326, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Jiahao Qiu, Xuan Qi, Tongcheng Zhang, Xinzhe Juan, Jiacheng Guo, Yifu Lu, Yimin Wang, Zixin Yao, Qihan Ren, Xun Jiang, Xing Zhou, Dongrui Liu, Ling Yang, Yue Wu, Kaixuan Huang, Shilong Liu, Hongru Wang, and Mengdi Wang. Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025. URLhttps://arxiv...
-
[52]
Team et al.Scaling Instructable Agents Across Many Simulated Worlds
Maria Abi Raad, Arun Ahuja, Catarina Barros, Frederic Besse, Andrew Bolt, Adrian Bolton, Bethanie Brownfield, Gavin Buttimore, Max Cant, Sarah Chakera, et al. Scaling instructable agents across many simulated worlds. arXiv preprint arXiv:2404.10179, 2024
-
[53]
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents, 2024. URLhttps://arxiv.org/abs/2405.14573
work page internal anchor Pith review arXiv 2024
-
[54]
Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent.arXiv preprint arXiv:2205.06175, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[55]
Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[57]
Ui-tars-1.5.https://seed-tars.com/1.5, 2025
ByteDance Seed. Ui-tars-1.5.https://seed-tars.com/1.5, 2025
work page 2025
-
[58]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300. 26
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, and Dong Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025
-
[60]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023
work page 2023
-
[61]
Mastering the game of go without human knowledge
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017
work page 2017
-
[62]
Huatong Song, Jinhao Jiang, Wenqing Tian, Zhipeng Chen, Yuhuan Wu, Jiahao Zhao, Yingqian Min, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher++: Incentivizing the dynamic knowledge acquisition of llms via reinforcement learning.arXiv preprint arXiv:2505.17005, 2025
-
[63]
Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu, Lei Fang, Zhongyuan Wang, and Ji-Rong Wen. Simpledeepsearcher: Deep information seeking via web-powered reasoning trajectory synthesis.CoRR, abs/2505.16834, 2025. doi: 10.48550/ARXIV.2505.16834. URLhttps://doi.org/10.48550/arXiv.2505.16834
-
[64]
A survey on (m) llm-based gui agents.arXiv preprint arXiv:2504.13865, 2025
Fei Tang, Haolei Xu, Hang Zhang, Siqi Chen, Xingyu Wu, Yongliang Shen, Wenqi Zhang, Guiyang Hou, Zeqi Tan, Yuchen Yan, et al. A survey on (m) llm-based gui agents.arXiv preprint arXiv:2504.13865, 2025
-
[65]
Kimi K2: Open Agentic Intelligence
Kimi Team. Kimi k2: Open agentic intelligence, 2025. URLhttps://arxiv.org/abs/2507.20534
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[67]
Terminal-bench: A benchmark for ai agents in terminal environments, Apr 2025
The Terminal-Bench Team. Terminal-bench: A benchmark for ai agents in terminal environments, Apr 2025. URLhttps://github.com/laude-institute/terminal-bench
work page 2025
-
[68]
Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning.nature, 575(7782):350–354, 2019
work page 2019
-
[69]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anand- kumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[70]
Acting less is reasoning more! teaching model to act efficiently, 2025
Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. Acting less is reasoning more! teaching model to act efficiently, 2025. URL https://arxiv.org/abs/2504.14870
-
[71]
Gui agents with foundation models: A comprehensive survey.arXiv preprint arXiv:2411.04890, 2024
Shuai Wang, Weiwen Liu, Jingxuan Chen, Yuqi Zhou, Weinan Gan, Xingshan Zeng, Yuhan Che, Shuai Yu, Xinlong Hao, Kun Shao, et al. Gui agents with foundation models: A comprehensive survey.arXiv preprint arXiv:2411.04890, 2024
-
[72]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[73]
Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models
Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, et al. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. IEEE Transactionson Pattern Analysis and Machine Intelligence, 2024
work page 2024
-
[74]
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[75]
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024. 27
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[76]
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advancesin Neural Information Processing Systems, 37:52040–52094, 2024
work page 2024
-
[77]
Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction.arXiv preprint arXiv:2412.04454, 2024
- [78]
-
[79]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024. URLhttps://arxiv.org/ abs/2405.15793
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[80]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.