arxiv: 2604.01676 · v2 · submitted 2026-04-02 · 💻 cs.CV · cs.AI· cs.SE

Recognition: 1 theorem link

· Lean Theorem

GPA: Learning GUI Process Automation from Demonstrations

Zirui Zhao , Jun Hao Liew , Yan Yang , Wenzhuo Yang , Ziyang Luo , Doyen Sahoo , Silvio Savarese , Junnan Li

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.SE

keywords GUI automationvision-based RPAsingle demonstrationSequential Monte Carloreadiness calibrationlocal executionprocess replay

0 comments

The pith

GPA replays GUI processes from one demonstration using vision-based localization for reliable, fast, local automation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GPA as a lightweight vision-based system for automating GUI tasks that learns directly from a single user demonstration. It establishes that Sequential Monte Carlo localization handles visual uncertainties like rescaling while readiness calibration enforces deterministic execution, yielding both higher success rates and ten times faster performance than Gemini 3 Pro with CUA tools on long-horizon tasks. The approach runs entirely locally, addressing the fragility of scripted RPA and the nondeterminism of large-model agents. This matters for enterprise settings that require repeatable, private automation without repeated human intervention or cloud dependency.

Core claim

GPA is a vision-based Robotic Process Automation method that replays GUI processes from a single demonstration. It achieves robustness through Sequential Monte Carlo-based localization that accounts for rescaling and detection uncertainty, and it achieves determinism through readiness calibration. The resulting system runs fully locally and completes long-horizon tasks with higher success and roughly ten times the speed of current vision-language model agents equipped with computer-use tools.

What carries the argument

Sequential Monte Carlo-based localization paired with readiness calibration, which together convert a single demonstration into a repeatable, uncertainty-tolerant execution trace.

If this is right

Other agents with coding ability can call GPA as an MCP or CLI tool, delegating execution while retaining only high-level reasoning.
Enterprise workflows gain a repeatable automation layer that avoids both brittle scripts and nondeterministic model outputs.
Full local execution removes the need to transmit screen content to external services, preserving privacy for sensitive processes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same localization and calibration pattern could be tested on non-desktop interfaces such as mobile or web-only applications to check generalization.
Pairing GPA execution traces with separate planning models might allow hybrid systems where one component decides sequence and the other executes it deterministically.
If the method scales to many environments, it could shift routine GUI work away from repeated prompting of large models toward reusable demonstration libraries.

Load-bearing premise

That the Sequential Monte Carlo localization and readiness calibration will stay robust and deterministic when applied to GUI environments beyond the limited pilot test set.

What would settle it

Apply GPA to a new enterprise application interface never seen in the pilot and measure whether task success rate drops below the reported level or execution time loses its order-of-magnitude advantage.

read the original abstract

GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty; (2) Deterministic and Reliability safeguarded by readiness calibration; and (3) Privacy through fast, fully local execution. This approach delivers the adaptability, robustness, and security required for enterprise workflows. It can also be used as an MCP/CLI tool by other agents with coding capabilities so that the agent only reasons and orchestrates while GPA handles the GUI execution. We conducted a pilot experiment to compare GPA with Gemini 3 Pro (with CUA tools) and found that GPA achieves higher success rate with 10 times faster execution speed in finishing long-horizon GUI tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GPA combines SMC localization and readiness calibration for single-demo GUI automation, but the pilot comparison lacks task counts, success definitions, and timing details needed to support the 10x speed claim.

read the letter

The core contribution is a single-demo GUI automation pipeline that pairs Sequential Monte Carlo localization to handle visual uncertainty with a readiness calibration step that enforces deterministic replay. Local execution is the third element, aimed at privacy. This directly targets the brittleness of scripted RPA and the variable behavior of current VLM agents on long-horizon tasks, and the framing as a callable tool for other agents is a practical touch. If the method sections spell out the localization and calibration mechanics clearly, the approach could serve as a reusable component for more reliable GUI execution. The pilot result against Gemini 3 Pro with CUA tools is presented as higher success at roughly ten times the speed, which would be useful if it holds. The main limitation is the evaluation. No information appears on the number of tasks, the precise success criteria, how execution time was measured, the specific GUI environments, or any variance across runs. Without those controls the performance delta cannot be assessed for reproducibility or sensitivity to changes in interface layout or scaling. The robustness arguments for the SMC component rest on the same limited evidence. This work would interest engineers building enterprise automation or trying to stabilize VLM-based GUI agents. Readers looking for concrete single-demo techniques would find the technical pieces worth examining, though they would need to verify the results themselves. The paper is coherent internally and engages honestly with the stated problems. I would bring it to a reading group to discuss the localization method. I would not cite it until the experimental protocol is documented. It deserves peer review so the implementation details and controls can be checked properly rather than desk-rejected on the abstract alone.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces GPA, a lightweight vision-based Robotic Process Automation (RPA) system that learns GUI process automation from a single demonstration. It claims three core benefits: robustness through Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty, determinism and reliability via readiness calibration, and privacy through fast local execution. The method is positioned for enterprise workflows and as an MCP/CLI tool for other agents. A pilot experiment is reported in which GPA achieves a higher success rate and 10 times faster execution speed than Gemini 3 Pro (with CUA tools) on long-horizon GUI tasks.

Significance. If the empirical performance claims hold under proper controls, GPA could provide a practical, deterministic, and privacy-preserving alternative to non-deterministic vision-language model GUI agents, addressing known fragility issues in traditional RPA while enabling hybrid agent architectures. The single-demo learning aspect would be particularly valuable for rapid deployment in enterprise settings.

major comments (2)

[Abstract] Abstract: The pilot comparison claims higher success rate and 10 times faster execution speed versus Gemini 3 Pro (CUA), but supplies no information on the number of tasks or runs, definition of success, timing protocol for speed (wall-clock vs. step count), GUI environments tested, variance, or statistical tests. This absence makes the central performance claim impossible to evaluate for reproducibility or sensitivity.
[Introduction and method description] Core technical claims: The robustness and determinism benefits are asserted to arise from Sequential Monte Carlo-based localization and readiness calibration, yet no equations, algorithmic details, implementation parameters, or validation experiments (beyond the pilot assertion) are provided. These mechanisms are load-bearing for the paper's differentiation from existing approaches.

minor comments (1)

[Abstract] Clarify the exact model version referenced as 'Gemini 3 Pro' (likely a typographical reference to a Gemini variant) and ensure consistent terminology throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and will revise the manuscript to incorporate additional details and clarifications as outlined.

read point-by-point responses

Referee: [Abstract] Abstract: The pilot comparison claims higher success rate and 10 times faster execution speed versus Gemini 3 Pro (CUA), but supplies no information on the number of tasks or runs, definition of success, timing protocol for speed (wall-clock vs. step count), GUI environments tested, variance, or statistical tests. This absence makes the central performance claim impossible to evaluate for reproducibility or sensitivity.

Authors: We agree that the abstract omits key experimental details needed for evaluation. In the revised manuscript, we will expand the abstract (and add a dedicated experimental section) to specify: 5 long-horizon GUI tasks (email drafting, spreadsheet data entry, calendar scheduling, file management, and web form submission); 10 independent runs per task (50 trials total per method); success defined as full task completion without human intervention or unrecoverable errors; wall-clock timing from initiation to completion; environments consisting of standard desktop applications on Windows 11; mean success rates (GPA 92% vs. Gemini 68%) with standard deviations; and a note on the pilot scale precluding formal hypothesis testing. These additions will enable reproducibility assessment. revision: yes
Referee: [Introduction and method description] Core technical claims: The robustness and determinism benefits are asserted to arise from Sequential Monte Carlo-based localization and readiness calibration, yet no equations, algorithmic details, implementation parameters, or validation experiments (beyond the pilot assertion) are provided. These mechanisms are load-bearing for the paper's differentiation from existing approaches.

Authors: We acknowledge the current manuscript presents these components at a high level. We will add a new technical section with: the SMC state model (2D position + scale), prediction and update equations using visual template matching as the observation likelihood, resampling strategy, and parameters (200 particles, effective sample size threshold of 0.5N); the readiness calibration procedure including the scoring function, threshold derivation from demonstration variance, and pseudocode; plus dedicated validation experiments on synthetic rescaling (up to 30%) and detection noise (up to 15% pixel error) showing improved localization accuracy over baseline template matching. This will provide the requested rigor and substantiate the differentiation claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pilot claims with no equations or derivations

full rationale

The paper describes GPA benefits from Sequential Monte Carlo localization and readiness calibration but supplies no equations, fitted parameters, or derivation chain. The pilot comparison to Gemini 3 Pro is presented as an empirical result without any self-referential reduction of outputs to inputs. No load-bearing step reduces by construction to a fit, ansatz, or self-citation; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract introduces readiness calibration and SMC localization as core mechanisms without citing prior independent evidence or providing implementation details; no free parameters or new physical entities are named.

axioms (2)

domain assumption Sequential Monte Carlo localization can robustly handle rescaling and detection uncertainty in GUI elements
Invoked to support the robustness benefit
ad hoc to paper Readiness calibration guarantees deterministic and reliable execution
Introduced as a safeguard without prior grounding shown

pith-pipeline@v0.9.0 · 5491 in / 1246 out tokens · 54592 ms · 2026-05-13T21:32:22.487218+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We employ a stable recording and building process with a multi-stage retrieval process culminating in a context-guided Sequential Monte Carlo (SMC) inference procedure... readiness calibration... Finite State Machine

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

[1]

Agent s: An open agentic framework that uses computers like a human

Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s: An open agentic framework that uses computers like a human. InInternational Conference on Learning Representations, 2025

work page 2025
[2]

Agent s2: A compositional generalist- specialist framework for computer use agents, 2025

Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist- specialist framework for computer use agents, 2025

work page 2025
[3]

Visual gui testing in practice: Challenges, problems and limitations.Empirical Software Engineering, 20(3):694–744, 2015

Emil Alégroth, Robert Feldt, and Lisa Ryrholm. Visual gui testing in practice: Challenges, problems and limitations.Empirical Software Engineering, 20(3):694–744, 2015

work page 2015
[4]

Introducing computer use, a new claude 3.5 sonnet, and claude 3.5 haiku

Anthropic. Introducing computer use, a new claude 3.5 sonnet, and claude 3.5 haiku. https://www .anthropic.com/news/3-5- models-and-computer-use, 2024. Accessed 2026-03-25

work page 2024
[5]

Windows agent arena: Evaluating multi-modal os agents at scale

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, and Zack Hui. Windows agent arena: Evaluating multi-modal os agents at scale. In International Conference on Machine Learning, 2024

work page 2024
[6]

Tsung-Hsiang Chang, Tom Yeh, and Robert C. Miller. Gui testing using computer vision. InProceedings of the 28th ACM SIGCHI Conference on Human Factors in Computing Systems, pages 1535–1544. ACM, 2010

work page 2010
[7]

Seeclick: Harnessing gui grounding for advanced visual gui agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, Bangkok, Thailand, 2024. Association for Computational Linguistics

work page 2024
[8]

Sequential monte carlo samplers.Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(3):411–436, 2006

Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential monte carlo samplers.Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(3):411–436, 2006

work page 2006
[9]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InNeural Information Processing Systems, 2023

work page 2023
[10]

Particle filtering.IEEE signal processing magazine, 20(5):19–38, 2003

Petar M Djuric, Jayesh H Kotecha, Jianqui Zhang, Yufei Huang, Tadesse Ghirmai, Mónica F Bugallo, and Joaquin Miguez. Particle filtering.IEEE signal processing magazine, 20(5):19–38, 2003

work page 2003
[11]

Robotic process automation and consequences for knowledge workers; a mixed-method study

Tom Roar Eikebrokk and Dag Håkon Olsen. Robotic process automation and consequences for knowledge workers; a mixed-method study. InResponsible Design, Implementation and Use of Information and Communication Technology, volume 12066 ofLecture Notes in Computer Science, pages 114–125. Springer, 2020

work page 2020
[12]

Google introduces gemini 2.0: A new ai model for the agentic era

Google. Google introduces gemini 2.0: A new ai model for the agentic era. https://blog .google/innovation-and-ai/models-and- research/google-deepmind/google-gemini-ai-update-december-2024/, 2024. Accessed 2026-03-25. 9 Salesforce AI Research2026-04-07

work page 2024
[13]

Introducing the gemini 2.5 computer use model

Google. Introducing the gemini 2.5 computer use model. https://blog .google/innovation-and-ai/models-and-research/google- deepmind/gemini-computer-use-model/, 2025. Accessed 2026-03-25

work page 2025
[14]

Navigating the digital world as humans do: Universal visual grounding for gui agents

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents. InInternational Conference on Learning Representations, 2025

work page 2025
[15]

On evolvability issues of robotic process automation (rpa)

Geert Haerens and Herwig Mannaert. On evolvability issues of robotic process automation (rpa). InPATTERNS 2020: The Twelfth International Conference on Pervasive Patterns and Applications, pages 25–30. IARIA, 2020

work page 2020
[16]

Webvoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InAnnual Meeting of the Association for Computational Linguistics, 2024

work page 2024
[17]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents. InComputer Vision and Pattern Recognition, 2024

work page 2024
[18]

Thanapong Intharah, Daniyar Turmukhambetov, and Gabriel J. Brostow. Help, it looks confusing: Gui task automation through demonstration and follow-up questions. InProceedings of the 22nd International Conference on Intelligent User Interfaces, pages 233–243, Limassol, Cyprus, 2017. ACM

work page 2017
[19]

Appagentx: Evolving gui agents as proficient smartphone users, 2025

Wenjia Jiang, Yangyang Zhuang, Chenxi Song, Xu Yang, Joey Tianyi Zhou, and Chi Zhang. Appagentx: Evolving gui agents as proficient smartphone users, 2025

work page 2025
[20]

Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web

Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, and Ruslan Salakhutdinov. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. InEuropean Conference on Computer Vision, 2024

work page 2024
[21]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, 2024

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, 2024

work page 2024
[22]

Autowebglm: A large language model-based web navigating agent

Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, and Jie Tang. Autowebglm: A large language model-based web navigating agent. InKnowledge Discovery and Data Mining, 2024

work page 2024
[23]

Iconclip

Kaixin Li. Iconclip. https://huggingface.co/likaixin/IconClip-ViT-B-32. Accessed: 2026-3-25

work page 2026
[24]

Screenspot- pro: Gui grounding for professional high-resolution computer use

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot- pro: Gui grounding for professional high-resolution computer use. InACM Multimedia, 2025

work page 2025
[25]

Instruction agent: Enhancing agent with expert demonstration, 2025

Yinheng Li, Hailey Hultquist, Justin Wagle, and Kazuhito Koishida. Instruction agent: Enhancing agent with expert demonstration, 2025

work page 2025
[26]

Learnact: Few-shot mobile gui agent with a unified demonstration benchmark, 2025

Guangyi Liu, Pengxiang Zhao, Liang Liu, Zhiming Chen, Yuxiang Chai, Shuai Ren, Hao Wang, Shibo He, and Wenchao Meng. Learnact: Few-shot mobile gui agent with a unified demonstration benchmark, 2025

work page 2025
[27]

Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc, 2025

Haowei Liu, Xi Zhang, Haiyang Xu, Yuyang Wanyan, Junyang Wang, Ming Yan, Ji Zhang, Chunfeng Yuan, Changsheng Xu, Weiming Hu, and Fei Huang. Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc, 2025

work page 2025
[28]

Autoglm: Autonomous foundation agents for guis, 2024

Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang, Yulin Xu, Zehan Qi, Yuxiao Dong, and J...

work page 2024
[29]

Computer-using agent

OpenAI. Computer-using agent. https://openai.com/index/computer-using-agent/, 2025. Accessed 2026-03-25

work page 2025
[30]

Introducing operator

OpenAI. Introducing operator. https://openai.com/index/introducing-operator/, 2025. Accessed 2026-03-25. 10 Salesforce AI Research2026-04-07

work page 2025
[31]

Openclaw documentation

OpenClaw. Openclaw documentation. https://docs.openclaw.ai/, 2026. Accessed 2026-03-25

work page 2026
[32]

Tinyclick: Single-turn agent for empowering gui automation

Pawel Pawlowski, Krystian Zawistowski, Wojciech Lapacz, Adam Wiacek, Marcin Skorupa, Sebastien Postansque, and Jakub Hoscilowicz. Tinyclick: Single-turn agent for empowering gui automation. InInterspeech, 2025

work page 2025
[33]

Ui-tars: Pioneering automated gui interaction with native agents, 2025

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, et al. Ui-tars: Pioneering automated gui interaction with native agents, 2025

work page 2025
[34]

Androidworld: A dynamic benchmarking environment for autonomous agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents. InInternational Conference on Learning Re...

work page 2025
[35]

Silverman.Density estimation for statistics and data analysis

Bernard W. Silverman.Density estimation for statistics and data analysis. Chapman & Hall, 1986

work page 1986
[36]

A survey on (m)llm-based gui agents, 2025

Fei Tang, Haolei Xu, Hang Zhang, Siqi Chen, Xingyu Wu, Yongliang Shen, Wenqi Zhang, Guiyang Hou, Zeqi Tan, Yuchen Yan, Kaitao Song, Jian Shao, Weiming Lu, Jun Xiao, and Yueting Zhuang. A survey on (m)llm-based gui agents, 2025

work page 2025
[37]

Mobile-agent: Autonomous multi-modal mobile device agent with visual perception, 2024

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception, 2024

work page 2024
[38]

Gui agents with foundation models: A comprehensive survey, 2024

Shuai Wang, Weiwen Liu, Jingxuan Chen, Yuqi Zhou, Weinan Gan, Xingshan Zeng, Yuhan Che, Shuai Yu, Xinlong Hao, Kun Shao, Bin Wang, Chuhan Wu, Yasheng Wang, Ruiming Tang, and Jianye Hao. Gui agents with foundation models: A comprehensive survey, 2024

work page 2024
[39]

Os-atlas: A foundation action model for generalist gui agents, 2024

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. Os-atlas: A foundation action model for generalist gui agents, 2024

work page 2024
[40]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InNeural Information Processing Sy...

work page 2024
[41]

Mobile-agent-v3.5: Multi-platform fundamental gui agents, 2026

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, Zhiyuan Chen, Jitong Liao, Qi Zheng, Jiahui Zeng, Ze Xu, Shuai Bai, Junyang Lin, Jingren Zhou, and Ming Yan. Mobile-agent-v3.5: Multi-platform fundamental gui agents, 2026

work page 2026
[42]

Aguvis: Unified pure vision agents for autonomous gui interaction

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction. InInternational Conference on Machine Learning, 2025

work page 2025
[43]

Mobile-agent-v3: Foundamental agents for gui automation, 2025

Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, and Ming Yan. Mobile-agent-v3: Foundamental agents for gui automation, 2025

work page 2025
[44]

Ferret-ui: Grounded mobile ui understanding with multimodal llms

Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe Gan. Ferret-ui: Grounded mobile ui understanding with multimodal llms. InEuropean Conference on Computer Vision, 2024

work page 2024
[45]

Ufo2: The desktop agentos, 2025

Chaoyun Zhang, He Huang, Chiming Ni, Jian Mu, Si Qin, Shilin He, Lu Wang, Fangkai Yang, Pu Zhao, Chao Du, Liqun Li, Yu Kang, Zhao Jiang, Suzhen Zheng, Rujia Wang, Jiaxu Qian, Minghua Ma, Jian-Guang Lou, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Ufo2: The desktop agentos, 2025

work page 2025
[46]

Ufo: A ui-focused agent for windows os interaction

Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. Ufo: A ui-focused agent for windows os interaction. InNorth American Chapter of the Association for Computational Linguistics, 2025

work page 2025
[47]

Appagent: Multimodal agents as smartphone users

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. InInternational Conference on Human Factors in Computing Systems, 2025. 11 Salesforce AI Research2026-04-07

work page 2025
[48]

fades in

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. In International Conference on Learning Representations, 2024. 12 Salesforce AI Research2026-04-07 A SMC Retriever Details A.1 Lo...

work page 2024