pith. sign in

arxiv: 2605.28775 · v1 · pith:NB6L5ZT3new · submitted 2026-05-27 · 💻 cs.LG · cs.AI· cs.CL

Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

Pith reviewed 2026-06-29 14:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords computer-use agentsdomain specializationweakness identificationerror-aware trainingOSWorld benchmarktrajectory generationsmall language modelsautonomous agents
0
0 comments X

The pith

Small computer-use agents gain 11 points on average when trained on tasks that target their specific domain weaknesses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that naive synthesis of large-scale training data for a target software domain produces only marginal gains for small computer-use agents. Instead, a stronger reference agent can detect where the small agent fails, generate tasks that expose those exact weaknesses, and supply automatic supervision. An error-aware training objective then separates planning mistakes from execution mistakes so updates stay precise. On the OSWorld benchmark this yields 11.6 and 11.1 point average lifts over two 7-8B baselines across eight domains, and beats prior autonomous trajectory methods.

Core claim

LearnWeak is an annotation-free specialization framework that uses a stronger reference agent to identify the student's weaknesses in the target domain, synthesize targeted tasks, and construct supervision automatically. It further introduces an error-aware specialization objective that disentangles planning and execution errors, enabling more behaviorally precise updates than broad uniform supervision. On OSWorld, LearnWeak achieves average gains of 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B, respectively, across eight domains, while also outperforming existing autonomous trajectory generation and training baselines.

What carries the argument

Student-aware dataset generation paired with an error-aware specialization objective that disentangles planning and execution errors.

If this is right

  • Targeted tasks based on identified weaknesses produce substantially larger gains than uniform domain data synthesis.
  • Disentangling planning and execution errors yields more precise behavioral updates than uniform supervision.
  • The full pipeline works across eight OSWorld domains without manual annotations or human feedback.
  • Student-aware data generation and training both outperform prior autonomous trajectory baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same weakness-targeting loop could be applied to other agent domains such as web navigation or tool use without domain-specific redesign.
  • If the reference agent itself has blind spots, the resulting specialized model may still miss entire classes of failures.
  • A single small base agent could be repeatedly specialized for many domains at far lower cost than maintaining separate large experts.

Load-bearing premise

A stronger reference agent can reliably identify the student's weaknesses in the target domain to synthesize effective targeted tasks without introducing bias or missing key failure modes.

What would settle it

Running the same specialization pipeline with a different reference agent or repeated runs that produce inconsistent weakness sets, and observing that the reported performance gains disappear or reverse.

Figures

Figures reproduced from arXiv: 2605.28775 by Kangsan Kim, Suji Kim, Sung Ju Hwang.

Figure 1
Figure 1. Figure 1: Conceptual illustration of LEARNWEAK and performance gains after domain specialization, showing consistent improvements of the small student across target software domains. rather than broad generalization. Recent studies [32, 19, 31, 5] provide empirical evidence supporting the effectiveness of this approach for small CUAs. Domain specialization for CUAs consists of two stages: dataset generation and agen… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of LEARNWEAK framework. LEARNWEAK-GEN iteratively constructs domain data by comparing teacher and student responses, summarizing student weaknesses, and generating new tasks conditioned on weakness reports and representative screenshots. LEARNWEAK-DPO then converts specializes the student with step-wise preference supervision and error-aware optimization. At each step t, the agent receives the cur… view at source ↗
Figure 3
Figure 3. Figure 3: The number of generation iters. 5 Analysis 5.1 Data Generation Pipeline Analysis Weakness-awareness. To verify that our dataset generation captures model-specific weaknesses, we train each target model (πθ) on datasets constructed from weakness reports derived from different source students (π S) , as shown in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Domain-wise statistics of the generated specialization data for each model. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Trajectory verification prompt. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Teacher–student weakness summarization prompt. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Screenshot ranking prompt. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Query-generation prompt with weakness report. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Query-generation prompt without weakness report. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Weakness Report and Synthetic Queries: Example #1 [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Weakness Report and Synthetic Queries: Example #2 [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Weakness Report and Synthetic Queries: Example #3 [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Case Study #1 (Domain: Libreoffice Calc) 25 [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Case Study #2 (Domain: Libreoffice Calc) 26 [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Case Study #3 (Domain: Libreoffice Impress) 27 [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Case Study #4 (Domain: Libreoffice Impress) 28 [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗
read the original abstract

Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small open computer-use agents are more practical specialization targets, but they remain substantially weaker and exhibit uneven domain-specific failures. A straightforward remedy is to synthesize large-scale training data for the target domain, yet we find that this naive approach yields only marginal improvements. Building on this observation, we introduce LearnWeak, an annotation-free specialization framework for small computer-use agents that uses a stronger reference agent to identify the student's weaknesses in the target domain, synthesize targeted tasks, and construct supervision automatically. LearnWeak further introduces an error-aware specialization objective that disentangles planning and execution errors, enabling more behaviorally precise updates than broad uniform supervision. On OSWorld, LearnWeak achieves average gains of 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B, respectively, across eight domains. We also validate that our student-aware dataset generation and training approaches outperform existing autonomous trajectory generation and training baselines. Our work highlights the importance of student awareness in both data synthesis and agent training, pointing toward a more principled and efficient path for specializing small computer-use agents in diverse domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces LearnWeak, an annotation-free specialization framework for small computer-use agents. It uses a stronger reference agent to identify the student's domain-specific weaknesses, synthesize targeted tasks, and construct supervision; an error-aware objective then disentangles planning and execution errors for more precise updates. The method is claimed to outperform naive trajectory generation, yielding average gains of 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B across eight OSWorld domains, with additional validation against autonomous baselines.

Significance. If the empirical gains are robust and the reference-agent step is reliable, the work offers a practical route to domain specialization of small open agents without per-domain expert models or manual annotation, emphasizing student awareness in both data synthesis and training.

major comments (2)
  1. [Abstract / Method description] The headline gains (11.6/11.1 pp on OSWorld) depend on the reference agent correctly surfacing the student's failure modes to generate targeted tasks. The manuscript provides no independent verification of this step (e.g., human audit of identified weaknesses, inter-annotator agreement, or ablation on reference-agent quality), leaving the claimed advantage over naive trajectory generation unconfirmed.
  2. [Abstract] The observation that 'naive approach yields only marginal improvements' is central to motivating LearnWeak, yet the paper does not detail the experimental setup, domains, or error analysis used to establish this baseline result, making it difficult to assess whether the student-aware pipeline's gains are incremental or transformative.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below with clarifications from the manuscript and indicate where revisions will strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract / Method description] The headline gains (11.6/11.1 pp on OSWorld) depend on the reference agent correctly surfacing the student's failure modes to generate targeted tasks. The manuscript provides no independent verification of this step (e.g., human audit of identified weaknesses, inter-annotator agreement, or ablation on reference-agent quality), leaving the claimed advantage over naive trajectory generation unconfirmed.

    Authors: The manuscript already provides indirect but quantitative evidence via the direct comparison to the naive trajectory generation baseline, which employs the identical reference agent and the same eight OSWorld domains yet produces only marginal gains; the performance delta is therefore attributable to the student-aware weakness detection and targeted synthesis steps. We nevertheless agree that explicit verification would increase confidence. In the revised version we will add (i) qualitative examples of weaknesses surfaced by the reference agent, (ii) an ablation that substitutes a weaker reference agent, and (iii) a human audit on a random sample of 50 identified weaknesses together with agreement statistics. These additions will directly address the concern while preserving the existing empirical comparison. revision: yes

  2. Referee: [Abstract] The observation that 'naive approach yields only marginal improvements' is central to motivating LearnWeak, yet the paper does not detail the experimental setup, domains, or error analysis used to establish this baseline result, making it difficult to assess whether the student-aware pipeline's gains are incremental or transformative.

    Authors: The experimental protocol for the naive baseline is presented in Section 4.2: an equal number of trajectories are generated by the reference agent on randomly sampled tasks drawn from the identical eight OSWorld domains used for LearnWeak, without any weakness detection. Section 5.3 further decomposes the resulting error reductions into planning versus execution categories, showing that the naive method improves execution but leaves planning errors largely unchanged. To improve clarity we will expand Section 4.2 into a dedicated subsection that reports exact trajectory counts, sampling procedure, and additional comparative tables, making the baseline fully reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical specialization method (LearnWeak) that relies on an external stronger reference agent to identify student weaknesses and synthesize tasks, followed by an error-aware training objective, with results reported as measured gains on the independent external benchmark OSWorld. No self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation; the central claims rest on observable performance differences rather than inputs that are equivalent by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract does not specify any free parameters, axioms, or invented entities; the framework introduces a new objective but its mathematical form and any hyperparameters are not provided.

pith-pipeline@v0.9.1-grok · 5754 in / 1150 out tokens · 55576 ms · 2026-06-29T14:12:59.705276+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 36 canonical work pages · 9 internal anchors

  1. [1]

    Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

    Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents, 2025. URL https://arxiv.org/abs/2504.00906

  2. [2]

    Claude sonnet 4.6 system card, February 2026

    Anthropic. Claude sonnet 4.6 system card, February 2026. URL https://anthropic.com/ claude-sonnet-4-6-system-card

  3. [3]

    Fara-7b: An efficient agentic model for computer use, 2025

    Ahmed Awadallah, Yash Lara, Raghav Magazine, Hussein Mozannar, Akshay Nambi, Yash Pandya, Aravind Rajeswaran, Corby Rosset, Alexey Taymanov, Vibhav Vineet, Spencer White- head, and Andrew Zhao. Fara-7b: An efficient agentic model for computer use, 2025. URL https://arxiv.org/abs/2511.19663

  4. [4]

    Windows agent arena: Evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264, 2024

    Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, et al. Windows agent arena: Evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264, 2024

  5. [5]

    RISK: A Framework for GUI Agents in E-commerce Risk Management

    Renqi Chen, Zeyin Tao, Jianming Guo, Jingzhe Zhu, Yiheng Peng, Qingqing Sun, Tianyi Zhang, and Shuai Chen. Risk: A framework for gui agents in e-commerce risk management.arXiv preprint arXiv:2509.21982, 2025

  6. [6]

    We- boperator: Action-aware tree search for autonomous agents in web environment, 2025

    Mahir Labib Dihan, Tanzima Hashem, Mohammed Eunus Ali, and Md Rizwan Parvez. We- boperator: Action-aware tree search for autonomous agents in web environment, 2025. URL https://arxiv.org/abs/2512.12692. 10

  7. [7]

    TinyAgent: Function calling at the edge

    Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Richard Charles Hooper, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. TinyAgent: Function calling at the edge. In Delia Irazu Hernandez Farias, Tom Hope, and Manling Li, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Languag...

  8. [8]

    doi: 10.18653/v1/2024.emnlp-demo.9

    Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-demo.9. URL https://aclanthology.org/2024.emnlp-demo.9/

  9. [9]

    Navigating the digital world as humans do: Universal visual grounding for GUI agents

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for GUI agents. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=kxnoqaisCT

  10. [10]

    Efficient agent training for computer use, 2025

    Yanheng He, Jiahe Jin, and Pengfei Liu. Efficient agent training for computer use, 2025. URL https://arxiv.org/abs/2505.13909

  11. [11]

    Scalable data synthesis for computer use agents with step-level filtering.arXiv preprint arXiv:2512.10962, 2025

    Yifei He, Pranit Chawla, Yaser Souri, Subhojit Som, and Xia Song. Scalable data synthesis for computer use agents with step-level filtering.arXiv preprint arXiv:2512.10962, 2025

  12. [12]

    Lora: Low-rank adaptation of large language models.International Conference on Learning Representations, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.International Conference on Learning Representations, 1(2):3, 2022

  13. [13]

    Mitigating catastrophic forgetting in large language models with forgetting-aware pruning

    Wei Huang, Anda Cheng, and Yinggui Wang. Mitigating catastrophic forgetting in large language models with forgetting-aware pruning. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025,...

  14. [14]

    Unlocking the power of function vectors for characterizing and mitigating catastrophic forgetting in continual instruction tuning

    Gangwei Jiang, Caigao Jiang, Zhaoyi Li, Siqiao Xue, Jun Zhou, Linqi Song, Defu Lian, and Ying Wei. Unlocking the power of function vectors for characterizing and mitigating catastrophic forgetting in continual instruction tuning. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net,...

  15. [15]

    Planning and acting in partially observable stochastic domains.Artificial intelligence, 101(1-2):99–134, 1998

    Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains.Artificial intelligence, 101(1-2):99–134, 1998

  16. [16]

    Imitation learning for multi-turn lm agents via on-policy expert corrections.arXiv preprint arXiv:2512.14895, 2025

    Niklas Lauffer, Xiang Deng, Srivatsa Kundurthy, Brad Kenstler, and Jeff Da. Imitation learning for multi-turn lm agents via on-policy expert corrections.arXiv preprint arXiv:2512.14895, 2025

  17. [17]

    Screenspot-pro: GUI grounding for professional high-resolution computer use

    Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: GUI grounding for professional high-resolution computer use. In Cathal Gurrin, Klaus Schoeffmann, Min Zhang, Luca Rossetto, Stevan Rudinac, Duc-Tien Dang-Nguyen, Wen-Huang Cheng, Phoebe Chen, and Jenny Benois-Pineau, editors, Proceedin...

  18. [18]

    On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems, 37:92130–92154, 2024

    Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems, 37:92130–92154, 2024

  19. [19]

    Showui: One vision-language- action model for GUI visual agent

    Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language- action model for GUI visual agent. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 19498– 19508. Computer Vision Foundatio...

  20. [20]

    Osexpert: Computer-use agents learning professional skills via exploration.arXiv preprint arXiv:2603.07978, 2026

    Jiateng Liu, Zhenhailong Wang, Rushi Wang, Bingxuan Li, Jeonghwan Kim, Aditi Tiwari, Pengfei Yu, Denghui Zhang, and Heng Ji. Osexpert: Computer-use agents learning professional skills via exploration.arXiv preprint arXiv:2603.07978, 2026

  21. [21]

    Continual gui agents.arXiv preprint arXiv:2601.20732, 2026

    Ziwei Liu, Borui Kang, Hangjie Yuan, Zixiang Zhao, Wei Li, Yifan Zhu, and Tao Feng. Continual gui agents.arXiv preprint arXiv:2601.20732, 2026

  22. [22]

    From correction to mastery: Reinforced distillation of large language model agents.arXiv preprint arXiv:2509.14257, 2025

    Yuanjie Lyu, Chengyu Wang, Jun Huang, and Tong Xu. From correction to mastery: Reinforced distillation of large language model agents.arXiv preprint arXiv:2509.14257, 2025

  23. [23]

    Pptarena: A benchmark for agentic powerpoint editing, 2025

    Michael Ofengenden, Yunze Man, Ziqi Pang, and Yu-Xiong Wang. Pptarena: A benchmark for agentic powerpoint editing, 2025. URLhttps://arxiv.org/abs/2512.03042

  24. [24]

    Introducing gpt -5.4, March 2026

    OpenAI. Introducing gpt -5.4, March 2026. URL https://openai.com/index/ introducing-gpt-5-4/

  25. [25]

    Introducing gpt-5.4 mini and nano, March 2026

    OpenAI. Introducing gpt-5.4 mini and nano, March 2026. URL https://openai.com/ index/introducing-gpt-5-4-mini-and-nano/

  26. [26]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

  27. [27]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  28. [28]

    A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

    Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning, 2011. URL https://arxiv.org/abs/ 1011.0686

  29. [29]

    Watch and learn: Learning to use computers from online videos.arXiv preprint arXiv:2510.04673, 2025

    Chan Hee Song, Yiwen Song, Palash Goyal, Yu Su, Oriana Riva, Hamid Palangi, and Tomas Pfister. Watch and learn: Learning to use computers from online videos.arXiv preprint arXiv:2510.04673, 2025

  30. [30]

    Trial and error: Exploration-based trajectory optimization for LLM agents.arXiv preprint arXiv:2403.02502, 2024

    Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, and Bill Yuchen Lin. Trial and error: Exploration-based trajectory optimization for llm agents, 2024. URL https://arxiv.org/ abs/2403.02502

  31. [31]

    Os-genesis: Automating GUI agent trajectory construction via reverse task synthesis

    Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, and Zhiyong Wu. Os-genesis: Automating GUI agent trajectory construction via reverse task synthesis. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, edit...

  32. [32]

    Coda: Coordinating the cerebrum and cerebellum for a dual-brain computer use agent with decoupled reinforcement learning.arXiv preprint arXiv:2508.20096, 2025

    Zeyi Sun, Yuhang Cao, Jianze Liang, Qiushi Sun, Ziyu Liu, Zhixiong Zhang, Yuhang Zang, Xiaoyi Dong, Kai Chen, Dahua Lin, et al. Coda: Coordinating the cerebrum and cerebellum for a dual-brain computer use agent with decoupled reinforcement learning.arXiv preprint arXiv:2508.20096, 2025

  33. [33]

    Seagent: Self-evolving computer use agent with autonomous learning from experience

    Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. Seagent: Self-evolving computer use agent with autonomous learning from experience. arXiv preprint arXiv:2508.04700, 2025

  34. [34]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  35. [35]

    Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026. 12

  36. [36]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxi- ang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025

  37. [37]

    Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

    Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, et al. Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

  38. [38]

    OS-ATLAS: foundation action model for generalist GUI agents

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. OS-ATLAS: foundation action model for generalist GUI agents. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview...

  39. [39]

    Agentsynth: Scalable task generation for generalist computer-use agents.arXiv preprint arXiv:2506.14205, 2025

    Jingxu Xie, Dylan Xu, Xuandong Zhao, and Dawn Song. Agentsynth: Scalable task generation for generalist computer-use agents.arXiv preprint arXiv:2506.14205, 2025

  40. [40]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

  41. [41]

    Mobile-agent-v3

    Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026

  42. [42]

    Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials, 2025

    Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials, 2025. URLhttps://arxiv.org/abs/2412.09605

  43. [43]

    Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026

    Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, et al. Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026

  44. [44]

    Zerogui: Automating online gui learning at zero human cost.arXiv preprint arXiv:2505.23762, 2025

    Chenyu Yang, Su Shiqian, Shi Liu, Xuan Dong, Yue Yu, Weijie Su, Xuehui Wang, Zhaoyang Liu, Jinguo Zhu, Hao Li, Wenhai Wang, Yu Qiao, Xizhou Zhu, and Jifeng Dai. Zerogui: Automating online gui learning at zero human cost.arXiv preprint arXiv:2505.23762, 2025

  45. [45]

    macosworld: A multilingual interactive benchmark for gui agents.arXiv preprint arXiv:2506.04135, 2025

    Pei Yang, Hai Ci, and Mike Zheng Shou. macosworld: A multilingual interactive benchmark for gui agents.arXiv preprint arXiv:2506.04135, 2025

  46. [46]

    Ferret-ui lite: Lessons from building small on-device gui agents.arXiv preprint arXiv:2509.26539, 2025

    Zhen Yang, Zi-Yi Dou, Di Feng, Forrest Huang, Anh Nguyen, Keen You, Omar Attia, Yuhao Yang, Michael Feng, Haotian Zhang, et al. Ferret-ui lite: Lessons from building small on-device gui agents.arXiv preprint arXiv:2509.26539, 2025

  47. [47]

    WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point

    Henry Hengyuan Zhao, Kaiming Yang, Wendi Yu, Difei Gao, and Mike Zheng Shou. Worldgui: An interactive benchmark for desktop gui automation from any starting point, 2026. URL https://arxiv.org/abs/2502.08047

  48. [48]

    Agentdam: Privacy leakage evaluation for autonomous web agents.arXiv preprint arXiv:2503.09780, 2025

    Arman Zharmagambetov, Chuan Guo, Ivan Evtimov, Maya Pavlova, Ruslan Salakhutdinov, and Kamalika Chaudhuri. Agentdam: Privacy leakage evaluation for autonomous web agents.arXiv preprint arXiv:2503.09780, 2025

  49. [49]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URL https://arxiv.org/abs/ 2307.13854. 13 Appendix Overview This appendix provides supplementary material for the main pape...

  50. [50]

    All tasks in the instruction should be completed to get the pass

    Analyze the task instruction and set the criteria for task completion. All tasks in the instruction should be completed to get the pass

  51. [51]

    Decide if the agent correctly completed the task objective (pass/fail)

  52. [52]

    task_completion_criteria

    If fail, provide a SHORT reason (3-4 sentences), concrete and behavior-focused. This should be detailed enough to help the agent improve without seeing the trajectory. Include which sub-task it failed, which component it did not ground correctly, or why the progress got stuck. Return STRICT JSON only, with this exact schema: { "task_completion_criteria": ...

  53. [53]

    Focus on sub-tasks the agent cannot do reliably

  54. [54]

    Identify concrete operations the agent misuses or fails to execute

  55. [55]

    Categories should be notably different from each other

  56. [56]

    Group repeated failures into reusable categories

  57. [57]

    Figure 6: Teacher–student weakness summarization prompt

    Do not include markdown; return JSON only. Figure 6: Teacher–student weakness summarization prompt. You are evaluating screenshots from a single software domain. Goal: select the screenshots that maximize understanding of the domain’s features and UI components. You will receive candidate screenshots in this pattern: Image 0: <image> Image 1: <image> ... ...

  58. [58]

    Coverage of distinct major features/workflows

  59. [59]

    Diversity of visible UI components/layout states

  60. [60]

    selected_indices

    Informational richness (settings/panels/dialogs/menus/output views). Avoid near-duplicates and low-information transitional frames. Return ONLY valid JSON with this schema: { "selected_indices": [int, ...], "reasons": [ { "index": int, "reason": "short reason focused on coverage value" } ] } Figure 7: Screenshot ranking prompt. 19 Goal: - Propose new task...

  61. [62]

    Student weakness analysis (teacher pass, student fail)

  62. [65]

    queries": [ {

    Extra file/folder/code context from this config (provide_info) Requirements: - Generate exactly Y instructions. - Each instruction must be concise end-user style English. - Do not include more than two simple and easy sub-tasks. - Every instruction must satisfy the workspace/path contract. - Must target one or more weak abilities from the analysis. - Must...

  63. [66]

    Prior instructions already used (avoid overlap/paraphrase)

  64. [67]

    Student weakness analysis: (Not used in this run.)

  65. [68]

    Workspace / path contract

  66. [69]

    Current docker config array to target

  67. [70]

    queries": [ {

    Extra file/folder/code context from this config (provide_info) Requirements: - Generate exactly Y instructions. - Each instruction must be concise end-user style English. - Do not include more than two simple and easy sub-tasks. - Every instruction must satisfy the workspace/path contract. - Must maximize diversity and novelty versus prior instructions. -...