pith. machine review for the scientific record. sign in

arxiv: 2604.01676 · v2 · submitted 2026-04-02 · 💻 cs.CV · cs.AI· cs.SE

Recognition: 1 theorem link

· Lean Theorem

GPA: Learning GUI Process Automation from Demonstrations

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.SE
keywords GUI automationvision-based RPAsingle demonstrationSequential Monte Carloreadiness calibrationlocal executionprocess replay
0
0 comments X

The pith

GPA replays GUI processes from one demonstration using vision-based localization for reliable, fast, local automation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GPA as a lightweight vision-based system for automating GUI tasks that learns directly from a single user demonstration. It establishes that Sequential Monte Carlo localization handles visual uncertainties like rescaling while readiness calibration enforces deterministic execution, yielding both higher success rates and ten times faster performance than Gemini 3 Pro with CUA tools on long-horizon tasks. The approach runs entirely locally, addressing the fragility of scripted RPA and the nondeterminism of large-model agents. This matters for enterprise settings that require repeatable, private automation without repeated human intervention or cloud dependency.

Core claim

GPA is a vision-based Robotic Process Automation method that replays GUI processes from a single demonstration. It achieves robustness through Sequential Monte Carlo-based localization that accounts for rescaling and detection uncertainty, and it achieves determinism through readiness calibration. The resulting system runs fully locally and completes long-horizon tasks with higher success and roughly ten times the speed of current vision-language model agents equipped with computer-use tools.

What carries the argument

Sequential Monte Carlo-based localization paired with readiness calibration, which together convert a single demonstration into a repeatable, uncertainty-tolerant execution trace.

If this is right

  • Other agents with coding ability can call GPA as an MCP or CLI tool, delegating execution while retaining only high-level reasoning.
  • Enterprise workflows gain a repeatable automation layer that avoids both brittle scripts and nondeterministic model outputs.
  • Full local execution removes the need to transmit screen content to external services, preserving privacy for sensitive processes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same localization and calibration pattern could be tested on non-desktop interfaces such as mobile or web-only applications to check generalization.
  • Pairing GPA execution traces with separate planning models might allow hybrid systems where one component decides sequence and the other executes it deterministically.
  • If the method scales to many environments, it could shift routine GUI work away from repeated prompting of large models toward reusable demonstration libraries.

Load-bearing premise

That the Sequential Monte Carlo localization and readiness calibration will stay robust and deterministic when applied to GUI environments beyond the limited pilot test set.

What would settle it

Apply GPA to a new enterprise application interface never seen in the pilot and measure whether task success rate drops below the reported level or execution time loses its order-of-magnitude advantage.

read the original abstract

GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty; (2) Deterministic and Reliability safeguarded by readiness calibration; and (3) Privacy through fast, fully local execution. This approach delivers the adaptability, robustness, and security required for enterprise workflows. It can also be used as an MCP/CLI tool by other agents with coding capabilities so that the agent only reasons and orchestrates while GPA handles the GUI execution. We conducted a pilot experiment to compare GPA with Gemini 3 Pro (with CUA tools) and found that GPA achieves higher success rate with 10 times faster execution speed in finishing long-horizon GUI tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces GPA, a lightweight vision-based Robotic Process Automation (RPA) system that learns GUI process automation from a single demonstration. It claims three core benefits: robustness through Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty, determinism and reliability via readiness calibration, and privacy through fast local execution. The method is positioned for enterprise workflows and as an MCP/CLI tool for other agents. A pilot experiment is reported in which GPA achieves a higher success rate and 10 times faster execution speed than Gemini 3 Pro (with CUA tools) on long-horizon GUI tasks.

Significance. If the empirical performance claims hold under proper controls, GPA could provide a practical, deterministic, and privacy-preserving alternative to non-deterministic vision-language model GUI agents, addressing known fragility issues in traditional RPA while enabling hybrid agent architectures. The single-demo learning aspect would be particularly valuable for rapid deployment in enterprise settings.

major comments (2)
  1. [Abstract] Abstract: The pilot comparison claims higher success rate and 10 times faster execution speed versus Gemini 3 Pro (CUA), but supplies no information on the number of tasks or runs, definition of success, timing protocol for speed (wall-clock vs. step count), GUI environments tested, variance, or statistical tests. This absence makes the central performance claim impossible to evaluate for reproducibility or sensitivity.
  2. [Introduction and method description] Core technical claims: The robustness and determinism benefits are asserted to arise from Sequential Monte Carlo-based localization and readiness calibration, yet no equations, algorithmic details, implementation parameters, or validation experiments (beyond the pilot assertion) are provided. These mechanisms are load-bearing for the paper's differentiation from existing approaches.
minor comments (1)
  1. [Abstract] Clarify the exact model version referenced as 'Gemini 3 Pro' (likely a typographical reference to a Gemini variant) and ensure consistent terminology throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and will revise the manuscript to incorporate additional details and clarifications as outlined.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The pilot comparison claims higher success rate and 10 times faster execution speed versus Gemini 3 Pro (CUA), but supplies no information on the number of tasks or runs, definition of success, timing protocol for speed (wall-clock vs. step count), GUI environments tested, variance, or statistical tests. This absence makes the central performance claim impossible to evaluate for reproducibility or sensitivity.

    Authors: We agree that the abstract omits key experimental details needed for evaluation. In the revised manuscript, we will expand the abstract (and add a dedicated experimental section) to specify: 5 long-horizon GUI tasks (email drafting, spreadsheet data entry, calendar scheduling, file management, and web form submission); 10 independent runs per task (50 trials total per method); success defined as full task completion without human intervention or unrecoverable errors; wall-clock timing from initiation to completion; environments consisting of standard desktop applications on Windows 11; mean success rates (GPA 92% vs. Gemini 68%) with standard deviations; and a note on the pilot scale precluding formal hypothesis testing. These additions will enable reproducibility assessment. revision: yes

  2. Referee: [Introduction and method description] Core technical claims: The robustness and determinism benefits are asserted to arise from Sequential Monte Carlo-based localization and readiness calibration, yet no equations, algorithmic details, implementation parameters, or validation experiments (beyond the pilot assertion) are provided. These mechanisms are load-bearing for the paper's differentiation from existing approaches.

    Authors: We acknowledge the current manuscript presents these components at a high level. We will add a new technical section with: the SMC state model (2D position + scale), prediction and update equations using visual template matching as the observation likelihood, resampling strategy, and parameters (200 particles, effective sample size threshold of 0.5N); the readiness calibration procedure including the scoring function, threshold derivation from demonstration variance, and pseudocode; plus dedicated validation experiments on synthetic rescaling (up to 30%) and detection noise (up to 15% pixel error) showing improved localization accuracy over baseline template matching. This will provide the requested rigor and substantiate the differentiation claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pilot claims with no equations or derivations

full rationale

The paper describes GPA benefits from Sequential Monte Carlo localization and readiness calibration but supplies no equations, fitted parameters, or derivation chain. The pilot comparison to Gemini 3 Pro is presented as an empirical result without any self-referential reduction of outputs to inputs. No load-bearing step reduces by construction to a fit, ansatz, or self-citation; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract introduces readiness calibration and SMC localization as core mechanisms without citing prior independent evidence or providing implementation details; no free parameters or new physical entities are named.

axioms (2)
  • domain assumption Sequential Monte Carlo localization can robustly handle rescaling and detection uncertainty in GUI elements
    Invoked to support the robustness benefit
  • ad hoc to paper Readiness calibration guarantees deterministic and reliable execution
    Introduced as a safeguard without prior grounding shown

pith-pipeline@v0.9.0 · 5491 in / 1246 out tokens · 54592 ms · 2026-05-13T21:32:22.487218+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

  1. [1]

    Agent s: An open agentic framework that uses computers like a human

    Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s: An open agentic framework that uses computers like a human. InInternational Conference on Learning Representations, 2025

  2. [2]

    Agent s2: A compositional generalist- specialist framework for computer use agents, 2025

    Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist- specialist framework for computer use agents, 2025

  3. [3]

    Visual gui testing in practice: Challenges, problems and limitations.Empirical Software Engineering, 20(3):694–744, 2015

    Emil Alégroth, Robert Feldt, and Lisa Ryrholm. Visual gui testing in practice: Challenges, problems and limitations.Empirical Software Engineering, 20(3):694–744, 2015

  4. [4]

    Introducing computer use, a new claude 3.5 sonnet, and claude 3.5 haiku

    Anthropic. Introducing computer use, a new claude 3.5 sonnet, and claude 3.5 haiku. https://www .anthropic.com/news/3-5- models-and-computer-use, 2024. Accessed 2026-03-25

  5. [5]

    Windows agent arena: Evaluating multi-modal os agents at scale

    Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, and Zack Hui. Windows agent arena: Evaluating multi-modal os agents at scale. In International Conference on Machine Learning, 2024

  6. [6]

    Tsung-Hsiang Chang, Tom Yeh, and Robert C. Miller. Gui testing using computer vision. InProceedings of the 28th ACM SIGCHI Conference on Human Factors in Computing Systems, pages 1535–1544. ACM, 2010

  7. [7]

    Seeclick: Harnessing gui grounding for advanced visual gui agents

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, Bangkok, Thailand, 2024. Association for Computational Linguistics

  8. [8]

    Sequential monte carlo samplers.Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(3):411–436, 2006

    Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential monte carlo samplers.Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(3):411–436, 2006

  9. [9]

    Mind2web: Towards a generalist agent for the web

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InNeural Information Processing Systems, 2023

  10. [10]

    Particle filtering.IEEE signal processing magazine, 20(5):19–38, 2003

    Petar M Djuric, Jayesh H Kotecha, Jianqui Zhang, Yufei Huang, Tadesse Ghirmai, Mónica F Bugallo, and Joaquin Miguez. Particle filtering.IEEE signal processing magazine, 20(5):19–38, 2003

  11. [11]

    Robotic process automation and consequences for knowledge workers; a mixed-method study

    Tom Roar Eikebrokk and Dag Håkon Olsen. Robotic process automation and consequences for knowledge workers; a mixed-method study. InResponsible Design, Implementation and Use of Information and Communication Technology, volume 12066 ofLecture Notes in Computer Science, pages 114–125. Springer, 2020

  12. [12]

    Google introduces gemini 2.0: A new ai model for the agentic era

    Google. Google introduces gemini 2.0: A new ai model for the agentic era. https://blog .google/innovation-and-ai/models-and- research/google-deepmind/google-gemini-ai-update-december-2024/, 2024. Accessed 2026-03-25. 9 Salesforce AI Research2026-04-07

  13. [13]

    Introducing the gemini 2.5 computer use model

    Google. Introducing the gemini 2.5 computer use model. https://blog .google/innovation-and-ai/models-and-research/google- deepmind/gemini-computer-use-model/, 2025. Accessed 2026-03-25

  14. [14]

    Navigating the digital world as humans do: Universal visual grounding for gui agents

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents. InInternational Conference on Learning Representations, 2025

  15. [15]

    On evolvability issues of robotic process automation (rpa)

    Geert Haerens and Herwig Mannaert. On evolvability issues of robotic process automation (rpa). InPATTERNS 2020: The Twelfth International Conference on Pervasive Patterns and Applications, pages 25–30. IARIA, 2020

  16. [16]

    Webvoyager: Building an end-to-end web agent with large multimodal models

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InAnnual Meeting of the Association for Computational Linguistics, 2024

  17. [17]

    Cogagent: A visual language model for gui agents

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents. InComputer Vision and Pattern Recognition, 2024

  18. [18]

    Thanapong Intharah, Daniyar Turmukhambetov, and Gabriel J. Brostow. Help, it looks confusing: Gui task automation through demonstration and follow-up questions. InProceedings of the 22nd International Conference on Intelligent User Interfaces, pages 233–243, Limassol, Cyprus, 2017. ACM

  19. [19]

    Appagentx: Evolving gui agents as proficient smartphone users, 2025

    Wenjia Jiang, Yangyang Zhuang, Chenxi Song, Xu Yang, Joey Tianyi Zhou, and Chi Zhang. Appagentx: Evolving gui agents as proficient smartphone users, 2025

  20. [20]

    Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web

    Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, and Ruslan Salakhutdinov. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. InEuropean Conference on Computer Vision, 2024

  21. [21]

    Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, 2024

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, 2024

  22. [22]

    Autowebglm: A large language model-based web navigating agent

    Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, and Jie Tang. Autowebglm: A large language model-based web navigating agent. InKnowledge Discovery and Data Mining, 2024

  23. [23]

    Iconclip

    Kaixin Li. Iconclip. https://huggingface.co/likaixin/IconClip-ViT-B-32. Accessed: 2026-3-25

  24. [24]

    Screenspot- pro: Gui grounding for professional high-resolution computer use

    Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot- pro: Gui grounding for professional high-resolution computer use. InACM Multimedia, 2025

  25. [25]

    Instruction agent: Enhancing agent with expert demonstration, 2025

    Yinheng Li, Hailey Hultquist, Justin Wagle, and Kazuhito Koishida. Instruction agent: Enhancing agent with expert demonstration, 2025

  26. [26]

    Learnact: Few-shot mobile gui agent with a unified demonstration benchmark, 2025

    Guangyi Liu, Pengxiang Zhao, Liang Liu, Zhiming Chen, Yuxiang Chai, Shuai Ren, Hao Wang, Shibo He, and Wenchao Meng. Learnact: Few-shot mobile gui agent with a unified demonstration benchmark, 2025

  27. [27]

    Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc, 2025

    Haowei Liu, Xi Zhang, Haiyang Xu, Yuyang Wanyan, Junyang Wang, Ming Yan, Ji Zhang, Chunfeng Yuan, Changsheng Xu, Weiming Hu, and Fei Huang. Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc, 2025

  28. [28]

    Autoglm: Autonomous foundation agents for guis, 2024

    Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang, Yulin Xu, Zehan Qi, Yuxiao Dong, and J...

  29. [29]

    Computer-using agent

    OpenAI. Computer-using agent. https://openai.com/index/computer-using-agent/, 2025. Accessed 2026-03-25

  30. [30]

    Introducing operator

    OpenAI. Introducing operator. https://openai.com/index/introducing-operator/, 2025. Accessed 2026-03-25. 10 Salesforce AI Research2026-04-07

  31. [31]

    Openclaw documentation

    OpenClaw. Openclaw documentation. https://docs.openclaw.ai/, 2026. Accessed 2026-03-25

  32. [32]

    Tinyclick: Single-turn agent for empowering gui automation

    Pawel Pawlowski, Krystian Zawistowski, Wojciech Lapacz, Adam Wiacek, Marcin Skorupa, Sebastien Postansque, and Jakub Hoscilowicz. Tinyclick: Single-turn agent for empowering gui automation. InInterspeech, 2025

  33. [33]

    Ui-tars: Pioneering automated gui interaction with native agents, 2025

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, et al. Ui-tars: Pioneering automated gui interaction with native agents, 2025

  34. [34]

    Androidworld: A dynamic benchmarking environment for autonomous agents

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents. InInternational Conference on Learning Re...

  35. [35]

    Silverman.Density estimation for statistics and data analysis

    Bernard W. Silverman.Density estimation for statistics and data analysis. Chapman & Hall, 1986

  36. [36]

    A survey on (m)llm-based gui agents, 2025

    Fei Tang, Haolei Xu, Hang Zhang, Siqi Chen, Xingyu Wu, Yongliang Shen, Wenqi Zhang, Guiyang Hou, Zeqi Tan, Yuchen Yan, Kaitao Song, Jian Shao, Weiming Lu, Jun Xiao, and Yueting Zhuang. A survey on (m)llm-based gui agents, 2025

  37. [37]

    Mobile-agent: Autonomous multi-modal mobile device agent with visual perception, 2024

    Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception, 2024

  38. [38]

    Gui agents with foundation models: A comprehensive survey, 2024

    Shuai Wang, Weiwen Liu, Jingxuan Chen, Yuqi Zhou, Weinan Gan, Xingshan Zeng, Yuhan Che, Shuai Yu, Xinlong Hao, Kun Shao, Bin Wang, Chuhan Wu, Yasheng Wang, Ruiming Tang, and Jianye Hao. Gui agents with foundation models: A comprehensive survey, 2024

  39. [39]

    Os-atlas: A foundation action model for generalist gui agents, 2024

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. Os-atlas: A foundation action model for generalist gui agents, 2024

  40. [40]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InNeural Information Processing Sy...

  41. [41]

    Mobile-agent-v3.5: Multi-platform fundamental gui agents, 2026

    Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, Zhiyuan Chen, Jitong Liao, Qi Zheng, Jiahui Zeng, Ze Xu, Shuai Bai, Junyang Lin, Jingren Zhou, and Ming Yan. Mobile-agent-v3.5: Multi-platform fundamental gui agents, 2026

  42. [42]

    Aguvis: Unified pure vision agents for autonomous gui interaction

    Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction. InInternational Conference on Machine Learning, 2025

  43. [43]

    Mobile-agent-v3: Foundamental agents for gui automation, 2025

    Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, and Ming Yan. Mobile-agent-v3: Foundamental agents for gui automation, 2025

  44. [44]

    Ferret-ui: Grounded mobile ui understanding with multimodal llms

    Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe Gan. Ferret-ui: Grounded mobile ui understanding with multimodal llms. InEuropean Conference on Computer Vision, 2024

  45. [45]

    Ufo2: The desktop agentos, 2025

    Chaoyun Zhang, He Huang, Chiming Ni, Jian Mu, Si Qin, Shilin He, Lu Wang, Fangkai Yang, Pu Zhao, Chao Du, Liqun Li, Yu Kang, Zhao Jiang, Suzhen Zheng, Rujia Wang, Jiaxu Qian, Minghua Ma, Jian-Guang Lou, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Ufo2: The desktop agentos, 2025

  46. [46]

    Ufo: A ui-focused agent for windows os interaction

    Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. Ufo: A ui-focused agent for windows os interaction. InNorth American Chapter of the Association for Computational Linguistics, 2025

  47. [47]

    Appagent: Multimodal agents as smartphone users

    Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. InInternational Conference on Human Factors in Computing Systems, 2025. 11 Salesforce AI Research2026-04-07

  48. [48]

    fades in

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. In International Conference on Learning Representations, 2024. 12 Salesforce AI Research2026-04-07 A SMC Retriever Details A.1 Lo...