InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction
Pith reviewed 2026-05-22 15:14 UTC · model grok-4.3
The pith
InfantAgent-Next reaches 7.27 percent accuracy on OSWorld by letting tool and vision agents collaborate in a modular setup.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
InfantAgent-Next achieves 7.27 percent accuracy on OSWorld, higher than Claude-Computer-Use, by integrating tool-based and pure vision agents within a highly modular architecture that enables different models to collaboratively solve decoupled tasks in a step-by-step manner. The same architecture supports evaluation on GAIA and SWE-Bench to demonstrate broader applicability across vision-based and tool-intensive computer interaction benchmarks.
What carries the argument
The highly modular architecture that integrates tool-based agents and pure vision agents to allow collaborative, step-by-step solving of decoupled tasks.
If this is right
- The agent can be evaluated on pure vision-based real-world benchmarks such as OSWorld.
- It also performs on general or tool-intensive benchmarks including GAIA and SWE-Bench.
- Different models can be plugged into the architecture to handle specific aspects of the overall task.
- The open-sourced codes and evaluation scripts allow direct replication and extension on new benchmarks.
Where Pith is reading between the lines
- The same splitting principle might reduce the size of any single model required for complex computer tasks by distributing perception and action across specialized components.
- Similar modularity could be tested in non-computer domains that mix visual observation with tool use, such as robotic manipulation or document processing.
- Future measurements could track whether coordination between the two agent types adds latency or error accumulation as task length increases.
Load-bearing premise
The modular split between tool-based and vision agents produces reliable step-by-step collaboration without hidden coordination costs or benchmark-specific tuning that would invalidate cross-benchmark comparisons.
What would settle it
A controlled test on OSWorld that runs the same underlying models both with and without the modular split and finds that the split version does not exceed single-model baselines would show the architecture does not deliver the claimed benefit.
Figures
read the original abstract
This paper introduces \textsc{InfantAgent-Next}, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner. Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i.e., OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and SWE-Bench). Specifically, we achieve $\mathbf{7.27\%}$ accuracy on OSWorld, higher than Claude-Computer-Use. Codes and evaluation scripts are open-sourced at https://github.com/bin123apple/InfantAgent.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces InfantAgent-Next, a multimodal generalist agent for automated computer interaction that integrates tool-based and pure-vision agents within a highly modular architecture. This design is claimed to enable different models to collaboratively solve decoupled tasks in a step-by-step manner. The approach is evaluated on OSWorld (reporting 7.27% accuracy, exceeding Claude-Computer-Use), as well as GAIA and SWE-Bench, with code and evaluation scripts open-sourced.
Significance. If the modular collaboration demonstrably improves performance beyond single-agent baselines or model selection effects, the work could advance generalist agents by showing how decoupled tool and vision components can be orchestrated without heavy workflow engineering. The open-sourcing of code and multi-benchmark evaluation are positive for reproducibility and generality claims.
major comments (2)
- [Experiments / Evaluation] The central claim attributes the 7.27% OSWorld accuracy to the modular integration of tool-based and pure-vision agents enabling reliable step-by-step collaboration. However, the manuscript provides no ablation studies that disable inter-agent handoff, force a unified model, or compare against single-agent variants on the same OSWorld subset. Without these controls, the performance gain cannot be isolated from component model choice or prompting.
- [Experiments] § on OSWorld results: the reported accuracy lacks error bars, detailed exclusion rules for task instances, or a full experimental protocol (e.g., number of runs, temperature settings, or failure mode categorization). This makes it difficult to assess whether the result is robust or benchmark-specific.
minor comments (2)
- [Architecture] The abstract and evaluation sections could more explicitly define the coordination protocol between tool-based and vision agents (e.g., message passing format or decision criteria for handoff).
- [Abstract] Minor notation inconsistency: the paper uses both 'InfantAgent-Next' and 'InfantAgent' in the abstract; standardize throughout.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments point by point below and will incorporate additional experimental details and analyses in the revised manuscript to better substantiate the claims.
read point-by-point responses
-
Referee: [Experiments / Evaluation] The central claim attributes the 7.27% OSWorld accuracy to the modular integration of tool-based and pure-vision agents enabling reliable step-by-step collaboration. However, the manuscript provides no ablation studies that disable inter-agent handoff, force a unified model, or compare against single-agent variants on the same OSWorld subset. Without these controls, the performance gain cannot be isolated from component model choice or prompting.
Authors: We agree that explicit ablations would more rigorously isolate the contribution of the modular handoff mechanism. The manuscript does compare against Claude-Computer-Use, which operates as a single-model agent without the described tool-vision decoupling, and reports higher accuracy on OSWorld while also showing results on GAIA and SWE-Bench. However, we did not run dedicated single-model or no-handoff variants on the identical OSWorld task subset. In the revision we will add these controls: a unified-model baseline using the same component models without inter-agent collaboration, and a version that disables handoff, to clarify whether the step-by-step modular orchestration provides gains beyond model selection or prompting. revision: yes
-
Referee: [Experiments] § on OSWorld results: the reported accuracy lacks error bars, detailed exclusion rules for task instances, or a full experimental protocol (e.g., number of runs, temperature settings, or failure mode categorization). This makes it difficult to assess whether the result is robust or benchmark-specific.
Authors: We acknowledge that the experimental reporting was insufficiently detailed. The original manuscript presented the headline accuracy but omitted variance measures and protocol specifics. In the revised version we will add error bars computed over multiple runs, state the number of runs and temperature settings (e.g., 0 for deterministic decoding), provide explicit exclusion criteria for task instances, and include a failure-mode breakdown. The complete protocol will be documented in an expanded appendix to support reproducibility and robustness evaluation. revision: yes
Circularity Check
No circularity: results are direct empirical evaluations on public benchmarks
full rationale
The manuscript presents an agent architecture and reports accuracy figures (e.g., 7.27% on OSWorld) obtained via direct evaluation on external public benchmarks. No equations, fitted parameters, or first-principles derivations are described that would reduce to the inputs by construction. The modular integration of tool-based and vision agents is an engineering choice whose contribution is asserted through benchmark outcomes rather than any self-referential definition or self-citation chain that bears the central load. This is a standard empirical systems paper whose claims remain independent of the circularity patterns enumerated.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Modular decomposition of tasks between tool and vision agents enables effective collaboration without prohibitive integration overhead
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Iterative Region Cropping and Mouse Click Logic (Algorithm 1)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Cortexa: Enhancing llm agents for software engineering tasks via improved localization and solution diversity.https://research.nvidia.com/labs/adlr/cortexa/
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents
Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025
work page internal anchor Pith review arXiv 2025
- [4]
-
[5]
Amazon q developer.https://aws.amazon.com/q/developer/
Amazon. Amazon q developer.https://aws.amazon.com/q/developer/
-
[6]
Anthropic. Claude 3.7 sonnet. Available athttps://www.anthropic.com/claude/sonnet
-
[7]
Anthropic. Claude computer use. Available at https://www.anthropic.com/news/ 3-5-models-and-computer-use
-
[8]
Appmap navie v2.https://appmap.io/product/appmap-navie.html
AppMap. Appmap navie v2.https://appmap.io/product/appmap-navie.html
-
[9]
Autocoderover.https://www.autocoderover.net/
AutoCodeRover. Autocoderover.https://www.autocoderover.net/
-
[10]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Autoagents: A framework for automatic agent generation.arXiv preprint arXiv:2309.17288, 2023
Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, Börje F Karlsson, Jie Fu, and Yemin Shi. Autoagents: A framework for automatic agent generation.arXiv preprint arXiv:2309.17288, 2023
-
[12]
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [13]
-
[14]
Hugging Face. GAIA benchmark leaderboard. https://huggingface.co/spaces/gaia-benchmark/ leaderboard, 2025. Accessed: 2025-05-15
work page 2025
-
[15]
Agentscope: A flexible yet robust multi-agent platform
Dawei Gao, Zitao Li, Xuchen Pan, Weirui Kuang, Zhijian Ma, Bingchen Qian, Fei Wei, Wenhao Zhang, Yuexiang Xie, Daoyuan Chen, et al. Agentscope: A flexible yet robust multi-agent platform.arXiv preprint arXiv:2402.14034, 2024
-
[16]
Google. Langfun. GitHub repository, 2025.https://github.com/google/langfun
work page 2025
-
[17]
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243, 2024. 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Significant Gravitas. Autogpt. https://github.com/Significant-Gravitas/AutoGPT, 2025. Ac- cessed: 2025-05-15
work page 2025
-
[19]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [20]
-
[21]
HKUDS. Auto-deep-research. GitHub repository. https://github.com/HKUDS/ Auto-Deep-Research
-
[22]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for multi-agent collaborative framework.arXiv preprint arXiv:2308.00352, 3(4):6, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
huggingface. Huggingface agents. https://huggingface.co/docs/transformers/v4.51.3/ agents
-
[24]
open deep research.https://huggingface.co/blog/open-deep-research
huggingface. open deep research.https://huggingface.co/blog/open-deep-research
-
[25]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
- [27]
-
[28]
Bin Lei, Yuchen Li, Yiming Zeng, Tao Ren, Yi Luo, Tianyu Shi, Zitian Gao, Zeyu Hu, Weitai Kang, and Qiuwu Chen. Infant agent: A tool-integrated, logic-driven agent with cost-effective api usage.arXiv preprint arXiv:2411.01114, 2024
-
[29]
Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Commu- nicative agents for" mind" exploration of large language model society.Advances in Neural Information Processing Systems, 36:51991–52008, 2023
work page 2023
-
[30]
Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use.arXiv preprint arXiv:2504.07981, 2025
-
[31]
Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465, 2024
Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465, 2024
-
[32]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024
work page 2024
-
[33]
Gaia: a benchmark for general ai assistants, 2023
Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants, 2023
work page 2023
-
[34]
Yohei Nakajima. Babyagi. GitHub repository, 2024. Accessed: 2025-05-14
work page 2024
-
[35]
Liangbo Ning, Ziran Liang, Zhuohang Jiang, Haohao Qu, Yujuan Ding, Wenqi Fan, Xiao-yong Wei, Shanru Lin, Hui Liu, Philip S Yu, et al. A survey of webagents: Towards next-generation ai agents for web automation with large foundation models.arXiv preprint arXiv:2503.23350, 2025
- [36]
-
[37]
OpenAI. Computer-using agent. Available at https://openai.com/index/ computer-using-agent/
-
[38]
Introducing openai o3 and o4-mini
OpenAI. Introducing openai o3 and o4-mini. Available at https://openai.com/index/ introducing-o3-and-o4-mini/
- [39]
-
[40]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions
Pascal J Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F Grewe, and Thilo Stadelmann. Ai agents for computer use: A review of instruction-based computer control, gui automation, and operator assistants.arXiv preprint arXiv:2501.16150, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
ServiceNow. Tapeagents. GitHub repository. https://github.com/ServiceNow/TapeAgents/tree/ ui_demo/examples/gaia_agent
-
[43]
SIMA. Sima. https://github.com/swe-bench/experiments/tree/main/evaluation/lite/ 20240706_sima_gpt4o
-
[44]
Swe-agent.https://github.com/SWE-agent/SWE-agent
SWE-agent. Swe-agent.https://github.com/SWE-agent/SWE-agent
- [45]
-
[46]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Openhands: An open platform for ai software developers as generalist agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. InThe Thirteenth International Conference on Learning Representations, 2024
work page 2024
-
[48]
codeshell.https://github.com/WisdomShell/codeshell
WisdomShell. codeshell.https://github.com/WisdomShell/codeshell
-
[49]
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversation. arXiv preprint arXiv:2308.08155, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer agents with self-improvement.arXiv preprint arXiv:2402.07456, 2024
-
[51]
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents. arXiv preprint arXiv:2410.23218, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Agentless: Demystifying llm-based software engineering agents.arXiv preprint, 2024
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.arXiv preprint, 2024
work page 2024
-
[53]
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024
work page 2024
-
[54]
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction.arXiv preprint arXiv:2412.04454, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-ui: Visual grounding for gui instructions.arXiv preprint arXiv:2412.16256, 2024. 12 A Case Analysis Figure 5 illustrates the step-by-step process by which INFANTAGENT-NEXTsolves a real-world query:“According to the World Bank, which countries had gross savings of over 3...
-
[56]
If you want to select this set of commands , please return : < toolkit > file_edit </ toolkit >
File editing related commands : This set of commands can be used to view file content , as well as perform additions , deletions , searches , and m o d i f i c a t i o n s on files . If you want to select this set of commands , please return : < toolkit > file_edit </ toolkit >
-
[57]
If you want to select this set of commands , please return : < toolkit > code_exec </ toolkit >
Code ex ec ut ion related commands : This set of commands can be used to execute code snippets . If you want to select this set of commands , please return : < toolkit > code_exec </ toolkit >
-
[58]
Computer i n t e r a c t i o n commands : These commands can be used to interact with the computer via the keyboard and mouse . If you want to select this set of commands , please return : < toolkit > c o m p u t e r _ i n t e r a c t i o n </ toolkit >
-
[59]
If you want to select this set of commands , please return : < toolkit > web_browse </ toolkit >
Web browsing related commands : This set of commands can be used to interact with web pages . If you want to select this set of commands , please return : < toolkit > web_browse </ toolkit >
-
[60]
File u n d e r s t a n d i n g related commands : This set of commands can be used to u n d e r s t a n d the content of files . Such as reading files , view images , listen to audios , watch videos , etc . If you want to select this set of commands , please return : < toolkit > file_understand </ toolkit > If you want to select multiple sets of commands ...
work page 1921
-
[61]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.