Recognition: 1 theorem link
· Lean TheoremGPA: Learning GUI Process Automation from Demonstrations
Pith reviewed 2026-05-13 21:32 UTC · model grok-4.3
The pith
GPA replays GUI processes from one demonstration using vision-based localization for reliable, fast, local automation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GPA is a vision-based Robotic Process Automation method that replays GUI processes from a single demonstration. It achieves robustness through Sequential Monte Carlo-based localization that accounts for rescaling and detection uncertainty, and it achieves determinism through readiness calibration. The resulting system runs fully locally and completes long-horizon tasks with higher success and roughly ten times the speed of current vision-language model agents equipped with computer-use tools.
What carries the argument
Sequential Monte Carlo-based localization paired with readiness calibration, which together convert a single demonstration into a repeatable, uncertainty-tolerant execution trace.
If this is right
- Other agents with coding ability can call GPA as an MCP or CLI tool, delegating execution while retaining only high-level reasoning.
- Enterprise workflows gain a repeatable automation layer that avoids both brittle scripts and nondeterministic model outputs.
- Full local execution removes the need to transmit screen content to external services, preserving privacy for sensitive processes.
Where Pith is reading between the lines
- The same localization and calibration pattern could be tested on non-desktop interfaces such as mobile or web-only applications to check generalization.
- Pairing GPA execution traces with separate planning models might allow hybrid systems where one component decides sequence and the other executes it deterministically.
- If the method scales to many environments, it could shift routine GUI work away from repeated prompting of large models toward reusable demonstration libraries.
Load-bearing premise
That the Sequential Monte Carlo localization and readiness calibration will stay robust and deterministic when applied to GUI environments beyond the limited pilot test set.
What would settle it
Apply GPA to a new enterprise application interface never seen in the pilot and measure whether task success rate drops below the reported level or execution time loses its order-of-magnitude advantage.
read the original abstract
GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty; (2) Deterministic and Reliability safeguarded by readiness calibration; and (3) Privacy through fast, fully local execution. This approach delivers the adaptability, robustness, and security required for enterprise workflows. It can also be used as an MCP/CLI tool by other agents with coding capabilities so that the agent only reasons and orchestrates while GPA handles the GUI execution. We conducted a pilot experiment to compare GPA with Gemini 3 Pro (with CUA tools) and found that GPA achieves higher success rate with 10 times faster execution speed in finishing long-horizon GUI tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces GPA, a lightweight vision-based Robotic Process Automation (RPA) system that learns GUI process automation from a single demonstration. It claims three core benefits: robustness through Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty, determinism and reliability via readiness calibration, and privacy through fast local execution. The method is positioned for enterprise workflows and as an MCP/CLI tool for other agents. A pilot experiment is reported in which GPA achieves a higher success rate and 10 times faster execution speed than Gemini 3 Pro (with CUA tools) on long-horizon GUI tasks.
Significance. If the empirical performance claims hold under proper controls, GPA could provide a practical, deterministic, and privacy-preserving alternative to non-deterministic vision-language model GUI agents, addressing known fragility issues in traditional RPA while enabling hybrid agent architectures. The single-demo learning aspect would be particularly valuable for rapid deployment in enterprise settings.
major comments (2)
- [Abstract] Abstract: The pilot comparison claims higher success rate and 10 times faster execution speed versus Gemini 3 Pro (CUA), but supplies no information on the number of tasks or runs, definition of success, timing protocol for speed (wall-clock vs. step count), GUI environments tested, variance, or statistical tests. This absence makes the central performance claim impossible to evaluate for reproducibility or sensitivity.
- [Introduction and method description] Core technical claims: The robustness and determinism benefits are asserted to arise from Sequential Monte Carlo-based localization and readiness calibration, yet no equations, algorithmic details, implementation parameters, or validation experiments (beyond the pilot assertion) are provided. These mechanisms are load-bearing for the paper's differentiation from existing approaches.
minor comments (1)
- [Abstract] Clarify the exact model version referenced as 'Gemini 3 Pro' (likely a typographical reference to a Gemini variant) and ensure consistent terminology throughout.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and will revise the manuscript to incorporate additional details and clarifications as outlined.
read point-by-point responses
-
Referee: [Abstract] Abstract: The pilot comparison claims higher success rate and 10 times faster execution speed versus Gemini 3 Pro (CUA), but supplies no information on the number of tasks or runs, definition of success, timing protocol for speed (wall-clock vs. step count), GUI environments tested, variance, or statistical tests. This absence makes the central performance claim impossible to evaluate for reproducibility or sensitivity.
Authors: We agree that the abstract omits key experimental details needed for evaluation. In the revised manuscript, we will expand the abstract (and add a dedicated experimental section) to specify: 5 long-horizon GUI tasks (email drafting, spreadsheet data entry, calendar scheduling, file management, and web form submission); 10 independent runs per task (50 trials total per method); success defined as full task completion without human intervention or unrecoverable errors; wall-clock timing from initiation to completion; environments consisting of standard desktop applications on Windows 11; mean success rates (GPA 92% vs. Gemini 68%) with standard deviations; and a note on the pilot scale precluding formal hypothesis testing. These additions will enable reproducibility assessment. revision: yes
-
Referee: [Introduction and method description] Core technical claims: The robustness and determinism benefits are asserted to arise from Sequential Monte Carlo-based localization and readiness calibration, yet no equations, algorithmic details, implementation parameters, or validation experiments (beyond the pilot assertion) are provided. These mechanisms are load-bearing for the paper's differentiation from existing approaches.
Authors: We acknowledge the current manuscript presents these components at a high level. We will add a new technical section with: the SMC state model (2D position + scale), prediction and update equations using visual template matching as the observation likelihood, resampling strategy, and parameters (200 particles, effective sample size threshold of 0.5N); the readiness calibration procedure including the scoring function, threshold derivation from demonstration variance, and pseudocode; plus dedicated validation experiments on synthetic rescaling (up to 30%) and detection noise (up to 15% pixel error) showing improved localization accuracy over baseline template matching. This will provide the requested rigor and substantiate the differentiation claims. revision: yes
Circularity Check
No circularity: empirical pilot claims with no equations or derivations
full rationale
The paper describes GPA benefits from Sequential Monte Carlo localization and readiness calibration but supplies no equations, fitted parameters, or derivation chain. The pilot comparison to Gemini 3 Pro is presented as an empirical result without any self-referential reduction of outputs to inputs. No load-bearing step reduces by construction to a fit, ansatz, or self-citation; the work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Sequential Monte Carlo localization can robustly handle rescaling and detection uncertainty in GUI elements
- ad hoc to paper Readiness calibration guarantees deterministic and reliable execution
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We employ a stable recording and building process with a multi-stage retrieval process culminating in a context-guided Sequential Monte Carlo (SMC) inference procedure... readiness calibration... Finite State Machine
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Agent s: An open agentic framework that uses computers like a human
Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s: An open agentic framework that uses computers like a human. InInternational Conference on Learning Representations, 2025
work page 2025
-
[2]
Agent s2: A compositional generalist- specialist framework for computer use agents, 2025
Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist- specialist framework for computer use agents, 2025
work page 2025
-
[3]
Emil Alégroth, Robert Feldt, and Lisa Ryrholm. Visual gui testing in practice: Challenges, problems and limitations.Empirical Software Engineering, 20(3):694–744, 2015
work page 2015
-
[4]
Introducing computer use, a new claude 3.5 sonnet, and claude 3.5 haiku
Anthropic. Introducing computer use, a new claude 3.5 sonnet, and claude 3.5 haiku. https://www .anthropic.com/news/3-5- models-and-computer-use, 2024. Accessed 2026-03-25
work page 2024
-
[5]
Windows agent arena: Evaluating multi-modal os agents at scale
Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, and Zack Hui. Windows agent arena: Evaluating multi-modal os agents at scale. In International Conference on Machine Learning, 2024
work page 2024
-
[6]
Tsung-Hsiang Chang, Tom Yeh, and Robert C. Miller. Gui testing using computer vision. InProceedings of the 28th ACM SIGCHI Conference on Human Factors in Computing Systems, pages 1535–1544. ACM, 2010
work page 2010
-
[7]
Seeclick: Harnessing gui grounding for advanced visual gui agents
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, Bangkok, Thailand, 2024. Association for Computational Linguistics
work page 2024
-
[8]
Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential monte carlo samplers.Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(3):411–436, 2006
work page 2006
-
[9]
Mind2web: Towards a generalist agent for the web
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InNeural Information Processing Systems, 2023
work page 2023
-
[10]
Particle filtering.IEEE signal processing magazine, 20(5):19–38, 2003
Petar M Djuric, Jayesh H Kotecha, Jianqui Zhang, Yufei Huang, Tadesse Ghirmai, Mónica F Bugallo, and Joaquin Miguez. Particle filtering.IEEE signal processing magazine, 20(5):19–38, 2003
work page 2003
-
[11]
Robotic process automation and consequences for knowledge workers; a mixed-method study
Tom Roar Eikebrokk and Dag Håkon Olsen. Robotic process automation and consequences for knowledge workers; a mixed-method study. InResponsible Design, Implementation and Use of Information and Communication Technology, volume 12066 ofLecture Notes in Computer Science, pages 114–125. Springer, 2020
work page 2020
-
[12]
Google introduces gemini 2.0: A new ai model for the agentic era
Google. Google introduces gemini 2.0: A new ai model for the agentic era. https://blog .google/innovation-and-ai/models-and- research/google-deepmind/google-gemini-ai-update-december-2024/, 2024. Accessed 2026-03-25. 9 Salesforce AI Research2026-04-07
work page 2024
-
[13]
Introducing the gemini 2.5 computer use model
Google. Introducing the gemini 2.5 computer use model. https://blog .google/innovation-and-ai/models-and-research/google- deepmind/gemini-computer-use-model/, 2025. Accessed 2026-03-25
work page 2025
-
[14]
Navigating the digital world as humans do: Universal visual grounding for gui agents
Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents. InInternational Conference on Learning Representations, 2025
work page 2025
-
[15]
On evolvability issues of robotic process automation (rpa)
Geert Haerens and Herwig Mannaert. On evolvability issues of robotic process automation (rpa). InPATTERNS 2020: The Twelfth International Conference on Pervasive Patterns and Applications, pages 25–30. IARIA, 2020
work page 2020
-
[16]
Webvoyager: Building an end-to-end web agent with large multimodal models
Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InAnnual Meeting of the Association for Computational Linguistics, 2024
work page 2024
-
[17]
Cogagent: A visual language model for gui agents
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents. InComputer Vision and Pattern Recognition, 2024
work page 2024
-
[18]
Thanapong Intharah, Daniyar Turmukhambetov, and Gabriel J. Brostow. Help, it looks confusing: Gui task automation through demonstration and follow-up questions. InProceedings of the 22nd International Conference on Intelligent User Interfaces, pages 233–243, Limassol, Cyprus, 2017. ACM
work page 2017
-
[19]
Appagentx: Evolving gui agents as proficient smartphone users, 2025
Wenjia Jiang, Yangyang Zhuang, Chenxi Song, Xu Yang, Joey Tianyi Zhou, and Chi Zhang. Appagentx: Evolving gui agents as proficient smartphone users, 2025
work page 2025
-
[20]
Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, and Ruslan Salakhutdinov. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. InEuropean Conference on Computer Vision, 2024
work page 2024
-
[21]
Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, 2024
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, 2024
work page 2024
-
[22]
Autowebglm: A large language model-based web navigating agent
Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, and Jie Tang. Autowebglm: A large language model-based web navigating agent. InKnowledge Discovery and Data Mining, 2024
work page 2024
- [23]
-
[24]
Screenspot- pro: Gui grounding for professional high-resolution computer use
Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot- pro: Gui grounding for professional high-resolution computer use. InACM Multimedia, 2025
work page 2025
-
[25]
Instruction agent: Enhancing agent with expert demonstration, 2025
Yinheng Li, Hailey Hultquist, Justin Wagle, and Kazuhito Koishida. Instruction agent: Enhancing agent with expert demonstration, 2025
work page 2025
-
[26]
Learnact: Few-shot mobile gui agent with a unified demonstration benchmark, 2025
Guangyi Liu, Pengxiang Zhao, Liang Liu, Zhiming Chen, Yuxiang Chai, Shuai Ren, Hao Wang, Shibo He, and Wenchao Meng. Learnact: Few-shot mobile gui agent with a unified demonstration benchmark, 2025
work page 2025
-
[27]
Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc, 2025
Haowei Liu, Xi Zhang, Haiyang Xu, Yuyang Wanyan, Junyang Wang, Ming Yan, Ji Zhang, Chunfeng Yuan, Changsheng Xu, Weiming Hu, and Fei Huang. Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc, 2025
work page 2025
-
[28]
Autoglm: Autonomous foundation agents for guis, 2024
Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang, Yulin Xu, Zehan Qi, Yuxiao Dong, and J...
work page 2024
-
[29]
OpenAI. Computer-using agent. https://openai.com/index/computer-using-agent/, 2025. Accessed 2026-03-25
work page 2025
-
[30]
OpenAI. Introducing operator. https://openai.com/index/introducing-operator/, 2025. Accessed 2026-03-25. 10 Salesforce AI Research2026-04-07
work page 2025
-
[31]
OpenClaw. Openclaw documentation. https://docs.openclaw.ai/, 2026. Accessed 2026-03-25
work page 2026
-
[32]
Tinyclick: Single-turn agent for empowering gui automation
Pawel Pawlowski, Krystian Zawistowski, Wojciech Lapacz, Adam Wiacek, Marcin Skorupa, Sebastien Postansque, and Jakub Hoscilowicz. Tinyclick: Single-turn agent for empowering gui automation. InInterspeech, 2025
work page 2025
-
[33]
Ui-tars: Pioneering automated gui interaction with native agents, 2025
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, et al. Ui-tars: Pioneering automated gui interaction with native agents, 2025
work page 2025
-
[34]
Androidworld: A dynamic benchmarking environment for autonomous agents
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents. InInternational Conference on Learning Re...
work page 2025
-
[35]
Silverman.Density estimation for statistics and data analysis
Bernard W. Silverman.Density estimation for statistics and data analysis. Chapman & Hall, 1986
work page 1986
-
[36]
A survey on (m)llm-based gui agents, 2025
Fei Tang, Haolei Xu, Hang Zhang, Siqi Chen, Xingyu Wu, Yongliang Shen, Wenqi Zhang, Guiyang Hou, Zeqi Tan, Yuchen Yan, Kaitao Song, Jian Shao, Weiming Lu, Jun Xiao, and Yueting Zhuang. A survey on (m)llm-based gui agents, 2025
work page 2025
-
[37]
Mobile-agent: Autonomous multi-modal mobile device agent with visual perception, 2024
Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception, 2024
work page 2024
-
[38]
Gui agents with foundation models: A comprehensive survey, 2024
Shuai Wang, Weiwen Liu, Jingxuan Chen, Yuqi Zhou, Weinan Gan, Xingshan Zeng, Yuhan Che, Shuai Yu, Xinlong Hao, Kun Shao, Bin Wang, Chuhan Wu, Yasheng Wang, Ruiming Tang, and Jianye Hao. Gui agents with foundation models: A comprehensive survey, 2024
work page 2024
-
[39]
Os-atlas: A foundation action model for generalist gui agents, 2024
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. Os-atlas: A foundation action model for generalist gui agents, 2024
work page 2024
-
[40]
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InNeural Information Processing Sy...
work page 2024
-
[41]
Mobile-agent-v3.5: Multi-platform fundamental gui agents, 2026
Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, Zhiyuan Chen, Jitong Liao, Qi Zheng, Jiahui Zeng, Ze Xu, Shuai Bai, Junyang Lin, Jingren Zhou, and Ming Yan. Mobile-agent-v3.5: Multi-platform fundamental gui agents, 2026
work page 2026
-
[42]
Aguvis: Unified pure vision agents for autonomous gui interaction
Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction. InInternational Conference on Machine Learning, 2025
work page 2025
-
[43]
Mobile-agent-v3: Foundamental agents for gui automation, 2025
Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, and Ming Yan. Mobile-agent-v3: Foundamental agents for gui automation, 2025
work page 2025
-
[44]
Ferret-ui: Grounded mobile ui understanding with multimodal llms
Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe Gan. Ferret-ui: Grounded mobile ui understanding with multimodal llms. InEuropean Conference on Computer Vision, 2024
work page 2024
-
[45]
Ufo2: The desktop agentos, 2025
Chaoyun Zhang, He Huang, Chiming Ni, Jian Mu, Si Qin, Shilin He, Lu Wang, Fangkai Yang, Pu Zhao, Chao Du, Liqun Li, Yu Kang, Zhao Jiang, Suzhen Zheng, Rujia Wang, Jiaxu Qian, Minghua Ma, Jian-Guang Lou, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Ufo2: The desktop agentos, 2025
work page 2025
-
[46]
Ufo: A ui-focused agent for windows os interaction
Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. Ufo: A ui-focused agent for windows os interaction. InNorth American Chapter of the Association for Computational Linguistics, 2025
work page 2025
-
[47]
Appagent: Multimodal agents as smartphone users
Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. InInternational Conference on Human Factors in Computing Systems, 2025. 11 Salesforce AI Research2026-04-07
work page 2025
-
[48]
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. In International Conference on Learning Representations, 2024. 12 Salesforce AI Research2026-04-07 A SMC Retriever Details A.1 Lo...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.